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ABSTRACT 



A multimedia extension unit (MEU) is provided for per- 
forming various multimedia-type operations. The MEU can 
be coupled either through a coprocessor bus or a local CPU 
bus to a conventional processor. The MEU employs vector 
registers, a vector ALU, and an operand routing unit (ORU) 
to perform a maximum number of the multimedia operations 
within as few instruction cycles as possible. Complex algo- 
rithms are readily performed by arranging operands upon the 
vector ALU in accordance with the desired algorithm flow- 
graph. The ORU aligns the operands within partitioned slots 
or sub -slots of the vector registers using vector instructions 
unique to the MEU. At the output of the ORU, operand pairs 
from vector source or destination registers can be easily 
routed and combined at the vector ALU. The vector instruc- 
tions employ special load/store instructions in combination 
with numerous operational instructions to carry out concur- 
rent multimedia operations on the aligned operands. 

31 Claims, 11 Drawing Sheets 



110b 



Int. Registers 
126 



Reorder Buffer 
131 




Execute 




Execute 




MEU 
90 




Load/ 


Unit 




Unit 






Store 


132a 




132n 






134 















Bus Interface 
122 



| CPU Bus 



Data Cache 
120 



11/14/2003, EAST Version: 1.4.1 



US 6,173,366 Bl 

Page 2 



U.S. PATENT DOCUMENTS 



4,783,736 ♦ 11/1988 Ziegler et al 711/130 

4,884,197 ♦ 11/1989 Sachs et al 711/123 

4,891,754 1/1990 Boreland . 

5,025,407 6/1991 Gulley et al. . 

5,193,167 3/1993 Sites et al. . 

5,307,300 4/1994 Komoto et al. . 

5,335,330 ♦ 8/1994 Inoue 712/241 

5,437,043 7/1995 Fujii et al. . 

5,481,713 * 1/1996 Wetmore et al 395/705 

5,513,366 * 4/1996 Agarwal et aL 712/22 

5,627,981 5/1997 Adler et al. . 

5,640,588 6/1997 Vegesna et al. . 

5,669,013 9/1997 Watanabe et aL . 

5,801,975 9/1998 Thayer et al. . 

5,845,083 * 12/1998 Hamadani et al 709/231 

5,893,145 4/1999 Thayer et al. . 

5,909,572 6/1999 Thayer et al. . 



OTHER PUBLICATIONS 
Kohn, L, et al., "The Visual Instruction Set (VIS) in UltraS- 
PARC," SPARC Technology Business— Sun Microsystems, 
Inc., 1996 IEEE, pp. 462-^89. 

Gwcnnap, linley, "UltraSparc Adds Multimedia Instruc- 
tions^ — Other New Instructions Handle Unaligned and Lit- 
tle-Endian Data," Microprocessor Report, Dec. 5, 1994, pp. 
16-18. 

Lee, Ruby B., "Realtime MPEG Video via Software Decom- 
pression on a PA-RISC Processor," Hewlett-Packard Com- 
pany, 1995 IEEE, pp. 186-192. 

Mattison, Phillip E., "Practical Digital Video With Program- 
ming Examples in C," Wiley Professional Computing, pp. 
15&-178. 

Zhou, Chang-Guo, et al., "MPEG Video Decoding With the 
UltraSPARC Visual Instruction Set," Sun Microsystems, 
lac, 1995 IEEE, pp. 470-474. 

* cited by examiner 



11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 1 of 11 US 6,173,366 Bl 





Image 
Encoder 
12 




Additional 






Original 


Compressed 


Optional 




Modulator 


Image * 


Image * 


Processing 


» 


16 






14 







Reconstructed 



Image 



Image 
Decoder 
24 



.Compressed 



Image 



Additional 
Optional 
Inverse 

Processing 
22 



FIG. 1 

(PRIOR ART) 



Channel 
or 

Storage 
18 



Demodulator 
20 



RGB 



Origina l 
Image*! 



YCrCb 



' J 



FIG. 2 

(PRIOR ART) 



32 



30 



DCT 
36 



26 



Intra/lnter 
Classifier 
50 



Motion 
Estimation 
48 



Quantization 
38 



Entropy 
*\ Encoding 
40 



Inverse 
Quantization 
42 



Inverse 
DCT 
44 



Reference 
Memory 
46 



Buffer 
34 



Compressed 
Image 



11/14/2003, EAST version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 2 of 11 



US 6,173,366 Bl 





CO 



J— 
EE 
< 



<2g 

Q. 




11/14/2003, EAST Version: 1.4.1 



U.S. Patent 



Jan. 9, 2001 Sheet 3 of 11 



US 6,173,366 Bl 




11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 4 of 11 



US 6,173,366 Bl 



HP 
92 



Coprocessor Bus 



CPU 



Bus 



Interrupt 
Controller 
96 



I/O 



Bus Bridge 
98 



Bus 



I/O 
Device 
102a 



I/O Bus 



I/O 
Device 
102b 




MEU 
90 



94 



J 



Main Memory 
104 



FIG. 5 



90 



Vector 
Regs. 
128 



ORU 
124 



Instruction Cache 
112 



Decode 
114 



Vector 
ALU 
116 



r 



110a 



Int. ALU 
118 



Bus Interface 
122 



« — ► 



Int. 
Regs. 
126 



Data Cache 
120 



CPU BUS 



FIG. 6 



11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 5 of 11 US 6,173,366 Bl 



110b 



Int. Registers 
126 



Reorder Buffer 
131 




Execute 
Unit 
132a 



Execute 
Unit 
132n 



T 



I 



MEU 
90 



I 



Load/ 
Store 
134 



J 



Bus Interface 
122 



Data Cache 
120 



| CPU Bus 



FIG. 7 



90 



128 



160x4 



128 



1 1 1 1 1 1 1 


1 1 1 1 1 1 1 


II 


I I 1 I I I 1 


I 


1 1 1 1 1 1 1 


I I I I I I I 


1 1 1 1 1 1 1 



7\ 



vO 
v1 
v2 
v3 



-S 

h— ss 



Operand 
Router Unit 
(ORU) 
124 



To Store 
Unit 



Vector ALU (VALU) 
116 



160x2 



SI 



128 



-7 From 
A Load Unit 



FIG. 8 



11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 6 of 11 US 6,173,366 Bl 



Source B Register 



SlotO 


Slot 1 


Slot 2 


Slot 3 


Slot 4 


Slot 5 


Slot 6 


Slot 7 



JJLL 



_LL 



125n 



125b 



Slot 



SMux 
125a 







I 




I 


Slots 







117n 




7-? 



0 

1.0 
-1.0 

124 



Decoded Slot S 
Selection Field 



Slots Partition of VALU fr 
117a 



116 



Destination Register 



Decoded 
Operation Code 



Slots 



FIG. 9 



Source B Register 











Slots 





Source A Register 



±± 



.0 

-1.0 

--1.0 

-124 



Slot SMux A- 













Value 






1 


f 



Decoded Slot s 
Selection Field 



Slots Partition of VALU 



-116 



Destination Register 




r 


Operation Code 












? 









FIG. 10 



11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 7 of 11 US 6,173,366 Bl 



Source B Register 



Slots 



0 

1.0 

-1.0 

+ 124 



Slot S Mux 



Destination Register 











Slots 







\/ 

Slot S Partition of VALU 



4 116 



Destination Register 










Slots 





FIG. 11 



Source A Register 













Slot 
S+1 






1 


r 



116 



Slot Partition of VALU 
S+1 



Destination Register 















Slot 
S 





FIG. 12 



11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 8 of 11 US 6,173,366 Bl 



UH LH 
SS SS 

HH 



UH LH 
SS SS 



UH LH 
SS SS 

HH 



Source B Register 



T 



T 



Source A Register 



T 



T 




146 



Destination Register 

I'll i im 



144 



_L 



_L 



_L 



142 



UH LH 
SS SS 

U » U » | 



140 



FIG. 13 



152 



154' 



n 


n 


n 


n 


n 


n|n|n| 


•f i i i i i i i 



148 < U j 10 -BitLoad J 

i i I I 



150 < c,i P 



i i i i i i 
n|n|n|nTQ 



rnr 

' ' 10-Bit Store ■ » 

ii ix 



i 


n| n 


n 


n 


n n|n 


n| 




9 




n 


n n 


n 


n 


n| n|n 


n|n|n n|n n n|n 




'i i i i i i i i i i i i i i i 



8-Bit Unsigned Value in 
Memory 

10-Bit Load 

10-Bit Signed Value in 
Register Partition 

10-Bit Store 

8-Bit Unsigned Value in 
Memory 



i i 
i i 
i i 



i i r 
till 
till 
i ii 
n|n 



20-Bit Load 
i i i i i i 



i i i i i 
i i i i i 
i i i i i 



0 0 

! I 

1 I 



n|n|n|n|n|n|n|n|n|n|n|n|n|n|n|Q|Q|Q|Ql 



20-Bit Store 



i i r 
i i 



i i 



i i i 



Ffn 



n n 



i 
i 

i i i i 
n | n | n | n 



T 
I I 
I I 

I I 



n i i r 
i i i i 

X X X X 



16-Bit Unsigned Value in 
Memory 

20-Bit Load 

20-Bit Signed Value in 
Register Partition 

20-Bit Store 

16-Bit Unsigned Value in 
Memory 



FIG. 14 



11/14/2003, EAST Version: 1.4.1 



U.S. Patent Jan.9,2001 Sheet 9 of ll US 6,173,366 Bl 



20-bit 




11/14/2003, EAST Version: 1.4.1 




11/14/2003, EAST version: 1.4.1 



U.S. Patent Jan. 9, 2001 Sheet 11 of 11 US 6,173,366 Bl 



Operation: dest <== dest +sourceA*sourceB 

-0.7637 <== -0.500 + -0.3203 *0.8203 



10-bit SourceA=-0.3203 



170 



10-bit SourceB=0.8262 172 



1 



0 



1 



1 



1 




19-bit Intermediate Product=-0.26463318 



174 



1 



1 



1 



1 



0 0 



1 



0 0 



1 



n 



1 



1 



1 




Round and Drop LS bits 
1 0-bit Intermediate Product=-0.2637 / 1 0-bit Destination^. 500 



-176 

+ / Add and Clip 



17.8 



1 0-bit Final Result=-0.7637 1 80 



1 



1 



FIG. 19 



11/14/2003, EAST Version:. 1.4.1 



US 6,173366 Bl 

1 2 

LOAD AND STORE INSTRUCTIONS WHICH sequence of images (i.e., full motion video). Upon receiving 

PERFORM UNPACKING AND PACKING OF the image in either RGB or YCrCb format, encoder 12 

DATA BITS IN SEPARATE VECTOR AND encodes certain "frames" of a plurality of frames within the 

INTEGER CACHE STORAGE sequence of motion images or still images. Frames within a 

5 video sequence can be compressed using numerous com- 

BACKGROUND OF THE INVENTION pression standards, a popular one being the Moving Pictures 

1. Field of the Invention Experts Group (MPEG) standard. MPEG compression 
This invention relates to digital signal processing (DSP), involves discerning intracoded frames from non-intracoded 

and more particularly to an extension unit added to a frames. An intracoded frame, often called I-frame, is corn- 
microprocessor for high speed multimedia applications. The 30 pressed relative to itself, while a non-intracoded frame, often 
extension unit includes an operand routing unit which aligns called P-frames and B-frames, are encoded by exploiting 
multiple operands upon an arithmetic logic unit (ALU) in temporal redundancy as well as spatial redundancy to reduce 
response to specific multimedia-type instructions. Proper me number of bits required for encoding, 
ordered arrangement of operands at the ALU enhances the Encoding and decoding video presents many challenges 
throughput of many image compression algorithms which 15 to realizing an efficient MPEG compression standard. The 
rely upon repetitive, sequential operations. intracoded frames are stored, generally in a moderately 

2. Description of the Relevant Art compressed format. Successive non-intracoded frames are 
It is well known that conventional computers communi- compared with the intracoded frames and the differences are 

cate information primarily through a graphical user interface „ n stored * Periodically, such as when a new scene is displayed, 
(GUI). The GUI involves manipulation of complex graphi- 20 a ncw m^ded frame is stored, and subsequent compan- 
cal images, as either still graphic images or full motion sons begin from this new reference point, 
video. Current software has spawned numerous multimedia Video compression standards such as MPEG, DVI and 
applications which require administering still images or Indeo, all use the intracoded frame technique. Many corn- 
video via the GUI. pression standards such as MPEG treat various frames 

Processing still images or video consumes prodigious ™ ^thin the frame sequence as a still image and apply still 

amounts of storage space within the computer. For example, ima S e compression to those frames. A popular still image 

a 256 color VGA screen image can entail numerous rows compression standard is Joint Photographic Experts Group 

and pixels, each consuming a single byte of store. For ( JPEG )- Encoder 12 illustrates numerous blocks used in 

example, a partial screen containing 200 rows of 320 pixels 30 MPEG video compression, of which a portion of those 

consumes a minimum of 64K bytes of storage. Real time blocks are pertinent to, e.g., JPEG. The JPEG portion of 

processing of still images (and especially video) thereby encoder 12 is shown within dashed area 26. Functional 

requires that the amount of data be reduced. The task of blocks ^thm dashed area 26 serve to compress pixel data 

reducing the amount of data necessary to store or transmit within blocks of each macro block from the original 

one or more digital images is often referred to as "image framc or ^ compressed digital data is then for- 

compression" warded into an embedded decoder 30. Embedded decoder 

Image compression can be classified as either lossy or f * ^ i in a feedback arrangement, wherein the output of 
lossless. If the reconstructed image is not identical to the decoder 30 is subtracted from the original frame^ Subtrac- 
original image, the compression is said to be lossy. Lossy J™ * ^ own u at bl °<* 32 ' a u nd J he ^ U f tpUt f rom functl0nal 
compression is used where the reconstructed image, while 40 b locks 26 is shown fed into a buffer 34 for subsequent output 
not identical to the original image, nonetheless conveys the as «"«pressed intracoded and non-intracoded frames, 
essential features of the image. Minor changes may not be In order to avoid having to store or transmit large amounts 
perceptible to a human observer, or may not be objectionable of information on each pixel within each frame, MPEG 
for a particular application. Lossy compression can therefore reduces the data to that which is pertinent only to intracoded 
reduce the amount of data relative to lossless compression 45 and non-intracoded frames. As seen in the feedback arrange- 
but without perceptible defects. menl of FIG, 2, data manipulation must be performed as 

FIG. 1 illustrates a conventional lossy image compression rapidly as possible on each macro block or frame, preferably 

system 10. System 10 is shown applicable to image (i.e., still in real time - Substantial data reduction (lossy compression) 

image or full motion image) compression and decompres- fe needcd on frames of intercst ™ d generally occurs in JPEG 

sion. An original image is compressed by an image encoder 50 blocks 26 and > more s P ecificallv > durm g quantization. 

12, and the encoded output may be further processed in JPEG generally employs three stages of compression, A 

block 14 using, for example, error correction, encryption, first stage utilizes a discrete cosine transform (DCT) func- 

multiplexing, etc. The compressed image can be stored or tion 36. DCT is a class of mathematical operations which 

sent through a communications channel. If forwarded take a signal and transform it from one type of representation 

through a communications channel, the compressed data is 55 to another. Specifically, DCT converts an array of numbers, 

modulated upon a carrier signal by modulator 16. The which represent signal amplitudes at various points in time 

data-modulated carrier signal is then forwarded to a decoder and space, into another array of numbers, each of which 

via channel 18. If the data is transmitted and requires represent the amplitude of a certain frequency component 

demodulation, block 20 is used to extract the compressed from the original signal. The resulting array of numbers 

image which can then be further processed as needed by 60 contains the same number of values as the original array, 

block 22. Block 22 is used to perform, for example, Using a JPEG format, DCT transform is performed on a 

decryption, demultiplexing, etc. Decoder 24 receives the block of 8x8 picture elements (or "pels") taken from an 

compressed image having redundant or irrelevant data original image. 

removed, and thereafter produces a reconstructed image Output from DCT 36 is fed to a quantizer 38. Quantization 

perceptibly similar to the original image. 65 38 involves the lossy stage of data compression by reducing 

FIG. 2 illustrates, in further detail, an image encoder 12 the number of bits needed to store an integer value of 

used for compressing an image as either a still image or a lessened precision. A quantization matrix, chosen by a code 
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word, reduces ihe matrix values output from DCT to the ematical computations demanded by the data compression 

indices for the code words. Upon decode, the images are algorithms. DSPs perform specific multimedia-type algo- 

reconstmcted using a table look-up procedure, given the rithms more efficiently than general purpose microproces- 

code word selected by the quantization algorithm. The sors. 

International Standards Organization (ISO) maintain the 5 There arc numerous types of DSPs which can perform 
quantatization code words used by implementers of JPEG JPEG and MPEG data compression. For example, Hewlett 
code. The quantization matrix can be coded in block 40 Packard Corp. PA-7100LC microprocessor functions not 
using several methods. For example, the quantized images only as a general purpose processor, but also as a DSP with 
of each frame can be arranged in a zig-zag sequence. The generic multimedia-type instructions added to increase data 
zig-zag sequence is then coded using run-length encoding 10 compression throughput. Compression throughput of the 
(RLE) followed by entropy coding (which includes the PA-7100LC is primarily limited by the execution time 
popular Huffman code). involved in performing DCT or inverse DCT (IDCT). See, 
Code output from block 40 is a variable length code which e.g., Lee, "Realtime MPEG Video via Software Decompres- 
generally represents smaller decimal numbers, and can be sion on a PA-RISC Processor", IEEE, 1995, pp. 186-192 
represented with corresponding smaller number of bits 15 (herein incorporated by reference). Sun Microsystems, Inc. 
depending upon the decimal value. An advantage of using has also devised a multimedia-type instruction set labeled 
smaller number variable length coding is carried forth within Visual Instruction Set (VIS) which is designed to run on the 
the intracoded and non-intracoded sequence of frames, or UltraSPARC™ processor. See, e.g., Kohn, et al., "The 
more particularly within each macro block of a frame. Visual Instruction Set (VIS) in UltraSPARC™" /E£E, 1995, 
Accordingly, MPEG involves JPEG-type compression on 20 PP- 462-469 (herein incorporated by reference); and, Chang- 
each selected frame macro block, coupled with frame-by- Guo Zhou, "MPEG Video Decoding With The Ultrasparc™ 
frame compression using motion estimation, motion com- Visual Instruction Set", IEEE, 1995, pp. 470-474 (herein 
pensation and frame classification. Motion estimation, incorporated by reference). Similar to the dedicated multi- 
motion compensation and frame classification is relevant on media instruction set used by the PA-7100LC, maximum 
only decoded pertinent frames which are produced as part of 25 efficiency of a VIS is limited to a particular multimedia 
the feedback loop within inverse quantization 42 and inverse application. For example, the optimized instruction set may 
DCT 44. After undergoing inverse quantization and inverse be efficient in performing fast fourier transforms (FFT), 
DCT, the resulting frames are stored in reference memory 46 motion estimation or Huffman encoding, but may be lacking 
where they can thereafter be drawn together and placed in other areas, such as the critical operation-intensive IDCT 
within motion estimation block 48. Motion estimation block 30 area. Further, while current multimedia instructions offer a 
48, in combination with intracoded and non-intracoded (i.e., fixed performance increase as to existing algorithms, they 
intr a/inter) frame classifier block 50, form the motion unfortunately do not always provide scalability to different 
estimation/compensation portion of MPEG. Motion com- types of algorithms or specific algorithms which change 
pensation is defined as a process of compensating for over time. As the new standards for JPEG, MPEG, DVI, 
displacement of moving objects from one frame to another, 35 Indeo and H.320 arrive, new algorithms may be needed 
and motion estimation is the process of estimating location where scalability to those operations is critical in achieving 
of corresponding pels with the frames. For each block in the viable, real-time compression. 

current P-frame, the block in the referenced frame (i.e., DCT and IDCT form a substantial part of an encode 

I-frame) which matches it best is identified by . a motion and/or decode algorithm, and certainly contribute numerous 

vector. The differences, undertaken by subtraction block 32, 40 operations to data compression. As shown in FIG. 2, DCT 

between the pixel values in the matching block in the and IDCT comprise prevalent portions of an encoder. For an 

reference frame and the current block in the current frame is 8x8 block of pixel elements, the DCT transform is generally 

then transformed, quantized and coded by blocks 26. represented as follows: 

Blocks 26 used for JPEG functionality, and the various 
blocks 42-50 used for MPEG decoding, feedback, motion 45 OCT(i, /) = (E* *) 
estimation/compensation, and frame classification are gen- 
erally well documented in the field of image compression. , / / VT^)C(oa,)Y f pixeifx, y)cos[(lc + 1)^/(2*8)1 
References to many of the blocks shown in FIG. 2 are set iio£o 
forth in numerous disclosures, an exemplary disclosure 

being Bhaskaran, et al. "Image Compression Standards and 50 cos[(2y -t- \)jnf{2* 8)] 

Architectures", ACM Multimedia 94, October, 1994, (herein c{x) = L / VT if x is 0, else 1 if x > 0 
incorporated by reference). 

Transformation of a picture element to a DCT output, as 

well as quantization and coding of that output, requires Equation 1 indicates numerous multiply, add (or subtract), 

algorithms unique to multimedia applications. Performing 55 shift, and accumulate operations needed to carry out DCT. 

decoding (inverse quantization and inverse DCT) as well as According to the article by Bhaskaran, several thousand 

motion estimation and compensation also require operation- multiply and add operations are necessary to perform the 

intensive algorithms. Those operations can generally be operations in equation 1. While faster algorithms reduce the 

classified as add, multiply, subtract, shift and accumulate operation count, the number of operations still remains 

operations, each of which must be performed as quickly as 60 daunting when performed on conventional DSPs. Even 

possible in order to make JPEG and MPEG a viable com- DSPs which have specialized multiply, add/subtract and 

pression standard. Dedicated digital signal processors accumulate multimedia- type instruction sets still require 

(DSPs) are generally used to carry out those operations in an numerous instruction cycles in order to complete DCT on a 

expeditious manner. DSPs are often included within multi- matrix of numbers. 

media devices such as sound cards, speech recognition 65 IDCT is carried out not only in an embedded decoder 30 

cards, video capture cards, etc. DSPs function as of encoder 12 (shown in FIG. 2), but also in the decoder 24 

coprocessors, performing complex and repetitive math- shown in FIG. 3 at the receiving end of a storage unit or 
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channel. Decoder 24 is shown for illustrative purposes as an array from over several thousand to a more manageable 

MPEG decoder, comprising functional blocks 56-66 which number, e.g., 16 multiplications and 26 additions (or 

essentially reverse the steps taken by an MPEG compression subtractions/accumulations). See, e.g., Bhaskaran, "Image 

encoder. Decoder 24 decodes the MPEG header, which Compression Standards and Architectures" ', pp. 10-12. 

provides information ^regarding the block, macro block, and s It dcsirab i c to introduce a DSP which can optimally 

frame or sequence of frames which i foUow the header. Hie ^ multimedia . t operations in a rapid manner , at or 

variable length encoded pels which follow the header are v x . -m. u- j* u i_ 

decoded into fixed length numbers by variable length decod- ncar ^ tUDC " multimedia operaUons would benefit 

ing block 56. A reverse order scan of blocks and macro *>m being executed upon a DSP formed as part of an 

blocks across the frame, and from frame-to -frame, is per- 1rt cxlstin S Pressor similar to conventional designs but with - 

formed at block 58. Next, inverse quant atization 60 is 30 out the scalability limitation. Thus, the desired DSP must be 

applied to the inverse scanned numbers to restore them to the capable of performing current or future-derived mathemati- 

original range. Then, an I DCT computation 62 is performed cal computations using not only an enhanced multimedia- 

on the blocks in each frame. IDCT converts the frequency type instruction set but also using enhancements to existing 

domain back to the original spatial domain, and provides the hardware. An improved DSP is thereby needed which func- 

actual pixel values for I-blocks, but only the differences for 35 lions as a hardware and software extension to an existing 

each pixel for P-blocks and B-b locks. Next, motion com- processor core. Responsive to multimedia instructions, a 

pensation is performed for P-blocks and B-blocks. The DSP is needed which allows routing of operands to an 

differences calculated in the IDCT computation are added to arithmetic logic unit (ALU) in accordance with present or 

the pixels in the reference block as determined by the motion future-desired algorithms. An improved DSP is needed 

vector, for P-blocks, and to the average of the forward and 20 which can route multiple operands (i.e., more than two 

backward reference blocks, for B-blocks. Motion compen- operands) simultaneously from partitioned, non-integer reg- 

sation is shown by reference numeral 64. Memory 66 is ^ {CTS t0 ^ ^jj depending upon any algorithm which 

periodically updated at each frame within a plurality of might be chosen . The improved DSP must be capable of 

frames which represent a reconstructed image. functioning on algorithms unique to JPEG, MPEG, DVI, 

Regardless of the data compression standard used, encode 25 m deo, H.320 and, more specifically, on any future algorithm 

and decode operations employ lengthy computations, and a which requires multiple operations carried out in a struc- 

substantial number of those computations involve DCT or ( ure£ j sequence of simultaneous operations. A popular algo- 

IDCT operations. Similar to DCT transform, IDCT requires rimm to which such a DSP would be particularly useful is 

a careful selection of operations sequentially applied as one involving IDCT. 

miiltiply, add, subtract, shift and I accumulate operations. An 30 EnhaElcements to existing pressors or to existing 

IDCT transform function for an 8x8 matrix can be shown as instruclion ^ are thereby neede d to make MPEG, JPEG, 

tollows: H.320, etc., more viable as data compression standards. It 

would be desirable to perform as many operation-intensive 
Pixel(x, y)= (Eq- 2) computations as possible in parallel, and within as few 
7 7 35 instruction cycles as possible. It would also be beneficial to 
t / (V2 * 8 ^ C[i)Oj)DCr\i, /)cos[(2* +- 1)*/ (2*8)] reorder operands such that operands exist in optimal order 
,=0 >° for such processing. Each operand within a set of operands 
cos[(2^-i- i)W(2*8)] niust be chosen from one of numerous locations within a 
non-integer register. Reading from and writing to non- 
C{x) = l / V2 if x is 0, else l if x > 0 mte g er registers would avoid bandwidth limitations on exist- 
ing integer registers, while allowing access to integer reg- 
. . . isters simultaneous with the multimedia-dedicated (non- 
There is no theoretical or mathematical limit on the size integer) registers 
of the input array for an IDCT computation. Equation 2 

would be the same for transforming an entire image, 45 SUMMARY OF THE INVENTION 
although the computation time required for that large an 

array would be prohibitive. As set forth in Mattison, Prac- The problems outlined above are in large part solved by 
tical Digital Video With Programming Examples In C (John a multimedia extension unit (MEU) of the present invention. 
Wiley & Sons, 1994) pp. 158-178 (herein incorporated by The MEU hereof embodies hardware components, and soft- 
reference), the number of multiplication operations required 50 ware instructions which optimally operate those compo- 
for each element of a one dimensional DCT matrix is nents. The MEU is added to an existing processor to more 
proportional to the square of the number of elements in the efficiently perform multimedia-type operations. Thus, the 
sample array. Accordingly, reducing the array size from a MEU functions as a DSP, but more specifically as a high 
two-dimensional array to a one-dimensional array (e.g., to a performance DSP necessary for achieving real time data 
1x8 array) serves to reduce the number of overall compu- 55 compression. The MEU can perform multiple operations 
tations for each array. The following equation illustrates an within a single instruction cycle and therefore is particularly 
IDCT transform function for converting a 1x8 matrix of useful in performing repetitive, sequential operations found 
elements to a 1x8 column of pixels: m MPEG, JPEG, DVI, Indeo and H.320 compression sys- 
tems. 

7 (Eq. 3) 60 The ability to perform multiple operations is contingent 

Pixd(m) = V2"/8^DCT(/ico')cos[(2m+ i)/*/(2*8)] U p 0D aligning numerous operands at select moments upon 

J=0 particular partitions within a partitioned arithmetic logic unit 

c{j)= 1/V2" when 7 = 0. else l if />0 (ALU). Thus, the MEU embodies hardware components 

which can arrange operands in response to specific operand 
65 routing instructions. To carry out operand routing, the MEU 

Dividing the original image into one-dimensional smaller employs, inter alia, three components: a partitioned ALU, an 

blocks helps reduce the number of computations on each operand routing unit (ORU), and a series of partitioned 
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registers. The ALU is partitioned into vectors, each of which registers"). An MEU operation does not involve continual 

can perform a separate and independent operation from the scheduling by the core CPU and, accordingly, the vector 

other vectors. TTie ORU aligns a series of operands on registers can operate concurrent with the integer registers 

respective partitions of the vector ALU so that an operation handling addressing calculations and program flow control, 

can be performed on each operand concurrent with opera- 5 Depending upon the complexity of the ORU, and the 

tions on other operands. The operands are provided from number of partitions within the vector registers and ALU, 

registers. Each register is partitioned into a series of vectors, the MEU can achieve a varying degree of performance 

classified as either slots or sub -slots. Each slot or sub-slot which is scaleable to any intended DSP application without 

contains sufficient bit locations to store an operand. constraining the CPU or the CPU bandwidth. The scaleable 

Operand routing is responsive to numerous multimedia- 10 architecture to either 8/10 or 16/20 bit data types helps 

type instructions unique to the MEU. The multimedia-type prevent limitations to current data compression algorithms 

instructions, or "vector instructions", are decoded to the and provides for future applications, such as revisions or 

ORU as well as to each vector of the ALU. Those instruc- upgrades to the MPEG and JPEG standards. While ORU and 

tions not only ensure that multiple operands are properly instruction decoding operates on slot boundaries, overall 

aligned to respective partitions of the ALU, but also serve to 15 flexibility is gained by serving operands which are sized to 

perform various data compression algorithms in a more eilner a slot or a sub-slot. Thus, depending upon the perfor- 

efiBcient manner. For example, a dedicated accumulate/ mance desired, either 10-bit or 20-bit operands can be 

merge instruction proves useful during the latter stages of an aligned and therefore either m or 2 m operations can be 

algorithm when intermediate results are being merged into a performed in a single instruction cycle. For example, in 

final result. An accumulate/merge instruction allows a final 20 motion estimation or during logic shifting, numerous 10-bit 

operation (i.e., add, subtract, multiply, etc.) to occur on an rather than fewer 20-bit values may be needed. This is 

intermediate result concurrent with a merging into the final certainly the case when numerous, low-precision operations 

result. Without an accumulate/merge instruction, repetitive must be carried out in rapid succession, 

move instructions are needed which cannot be performed in The more scaleable or flexible MEU architecture is there - 

parallel with other useful instructions. As another example, 25 fore attuned to almost any desired algorithm. According to 

a source partition shift instruction is used to quickly move one embodiment, the MEU can accommodate up to sixteen 

data within one slot of a source register to an incremental operands within a single instruction cycle. Given reasonable 

next slot within a destination register. A single source die constraints, the MEU can align those operands for 

partitioned shift instruction can thereby move data between discrete, concurrent operations. For example, the MEU can 

slots while simultaneously moving in new slot information. 30 perform all IDCT transform operations on a 1x8 set of 

Moving data between slots proves useful in performing values within only six instruction cycles, 

serial operations such as those found in FIR filters. As yet Broadly speaking, the present invention contemplates a 

another example, logic shift and arithmetic scaling instruc- system for routing operands to an ALU. The ALU is 

tions are used to readily perform pixel format conversions. partitioned, and the system comprises a first register and a 

By having the capability of shifting up to four bits in either 35 second register, denoted as vector registers. The vector 

direction, the shifting operation can easily convert between registers are contained in the MEU and are partitioned into 

pixel color formats, such as unpacking from low-color a plurality of slots and sub-slots. As part of the operand 

texture information in memory to higher-color display pix- routing hardware, a multiplexer is coupled to convey to the 

els. For example, 8-bit values can be readily unpacked to ALU an operand within any of the plurality slots of the 

16-bit data during a load operation, followed by multiple 40 second register. Thus, operands within the second register 

shifting of that loaded data exclusively within the MEU. are reordered in accordance with operands within the first 

Vector instructions allow loading of either an 8-bit byte or register. Operands within the second register and first reg- 

a 16-bit word into a 10-bit register subplot or a 20-bit ister are appropriately paired and simultaneously conveyed 

register slot. Accordingly, the MEU supports either 8/10 bit to separate partitions within the ALU. 

or 16/20 bit data types. Loading the slots/sub-slots from a 45 The present invention further contemplates a computer, 

memory location expands the data width by 25 percent. The computer comprises an input/output device operably 

Expanding the data width increases the precision of the DSP coupled to a microprocessor. The microprocessor includes 

operations. Adding two or four bits to a value increases the an instruction cache adapted for storing coded first and 

number of digits to the right of a fixed point number second sets of instructions. The first set of instructions 

resulting from, for example, an addition, subtraction or 50 comprises integer instructions, and a second set of instruc- 

multiplication operation. The added precision proves valu- tions comprises non-integer instructions, or vector instruc- 

able especially when performing an accumulation operation. tions. A decode unit is used for decoding and routing the 

A store instruction is opposite a load instruction, and per- vector instructions to a plurality of vector registers, an 

forms truncation on two or four of the least significant bits. operand routing unit (ORU) and a vector ALU. The vector 

Truncation is generally not a problem since most interme- 55 registers are useable for storing floating point numbers, but 

diate results are stored within the expanded bit locations of adapted for storing fixed point data values. The fixed point 

the registers. Generally, it is only after the operations are data values are periodically drawn upon by the ORU and the 

completed, and not during the interim, that the result is ALU. The ORU is responsive to a vector instruction for 

stored in truncated form. rearranging operands forwarded from a second register. The 

The present MEU can use pre-existing registers of an x86 60 operands are arranged so that each operand from the second 

floating point unit (FPU). Instead of storing floating point register is paired with an operand from the first register. The 

values, the vector instructions treat the registers as contain- pairing is chosen for achieving as many concurrent opera- 

ing fixed-point data values. The registers are partitioned into lions as possible. During each instruction cycle, an operation 

slots and sub-slots, and are thereby referred to as vector can be performed for each pair of sub-slots or for each pair 

registers containing data values involved with DSP calcu- 65 of slots, depending upon the amount of scalability desired, 

lations. The vector registers can be accessed concurrently The present invention further contemplates an MEU 

with the x86 CPU registers (generally referred to as "integer capable of executing two distinct sets of operations within a 
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single instruction cycle. The MEU includes first and second destination register. Further, upper and lower halves of a slot 

vector registers and a vector ALU. The vector ALU is within respective dissimilar source registers can be concat- 

partitioncd into a first logic portion and a second logic enated in various swapping arrangements to a slot within a 

portion, wherein the first logic portion is operably coupled destination register, all of which would be beneficial in 

to receive a first operand within one of the slots of the first 5 routing smaller bit operands or routing larger bit operands 

register and a second operand within one of the slots of the across the mid-slot barrier. Thus, the ORU routes slots, 

second register The second logic portion is operably while software instructions can route sub-slots within each 

coupled to receive a third operand within one of the slots of sloL 

the first register and a fourth operand within one of the slots The present invention yet further contemplates vector 

of the second register. The first and second logic portions 30 instructions such as conditional move and accumulate 

performs arithmetic operations concurrently, however, the instructions. A conditional move instruction is particularly 

arithmetic operation performed by the first logic portion may whcn mapping data from a source register to a 

be dissimilar from the operation performed by the second destination register depending upon the value of another 

logic portion. For example, an operation can be performed source register Conditional moves are often employed when 

on operands in each slot, and if eight slots are present, then 15 movin S a P? el value presented as an operand from one 

four add operations can be performed simultaneously with loc * U ° n ^ a a ° t t0 another depending upon the 

r l. . »• t/ 4 l a . i ■ condition of another pixel within that frame or another 

four subtract operations. Thus, the first logic portion can p „ ^ , *' „ , . f ■ r 

r j j l cl . j j j i_m * L frame. Conditional moves are needed when performing, for 

perform an add upon the first and second operands while the x mQ{ ^ c ^ m ^ on and compensation. In addition 

second logic portion can perform a subtract on .he third and tQ m accumll i ate operations are also ben- 

fourth operands. Of course, there . axe jaore than four oper- M cflcia] Fof , accumuUc fa n6 % ded when per f on ning 

ands which can be forwarded to the ALU and, specifically, . r • > .■ «■ i 
' . ' , ' f : . any type of running accumulation of arithmetic values, 

there are s operands sent from one source register while Vcctor inslructionS( such as conditional moye and 

another s operands are sent from another source register (or accumulat enhance DSP but> more specifica ii y> 

from a destmatiori register) Accordingly, operands for- do M whi , e avoidin unnecessary operations . 

warded to the ALU can arise from either source registers or ->< i>t . . • . • * , ■ 

from a destination register, the destination registering a Arithmetic scaling which is lacking frorr, , many conven- 

register for storing immediate results of operations upon the 10 ° a / 1 "Potions is readily performed as part of the present 

• . load/store instructions. For example, packing and unpacking 

source registers. . f , . * n \ r ? * , & 

~, . . r t t instructions found in many DSP instruction sets can be 

The present invention yet further contemplates instruc- avoided ^ un acking of 8 -bit word into a 20-bit slot 

lions which can load or store data to vector registers or 30 Qcaj[S ^ of fl bad mstnlction whereas aclcin of a 

memory locations, respectively. The system comprises a first ^ d t0 M 8 . bjt WQrd Qccurs as of me stofe 

memory element partitioned into a plurality of n bit slote Combining packing and unpacking operations 

each slot of which is further partitioned into a pair of n/2 b.t ^ stofe ^ ]oad he , elimiQate unnec move 

tub-dots. According to one embodiment, the first memory , ions which ^ as of sta[ld . alone conventiona i pack 

element is a vector register. The system further comprises * 35 and „ k 
second memory element partitioned into a plurality of n/2 bit 

memory locations. According to one embodiment, the sec- BRIEF DESCRIPTION OF THE DRAWINGS 

ond memory element is a semiconductor memory. A data bus Other objects and advantages of the invention will 

is connected between the first and second memory elements become apparent upon reading the following detailed 

for transferring a plurality of operands between the plurality 40 description and upon reference to the accompanying draw- 

of slots (or sub-slots) and memory locations. According to m g S in which: 

one embodiment, the data bus can be connected to load pT G i is a block diagram of an image compression 

operands in successive memory locations to successive pair system* 

of sub-slots. According to another embodiment, the data bus HQ \ % ^ a Wock d[ of an . encoder; 

can load zero values to a first set of slots within a plurality 45 1 • ui 1 a- c j j 

r , , ... , ,. « . 1 * 1 / FIG. 3 is a block diagram of an image decoder; 

of slots while loading operands into a plurality of slots _ T _ . _ 6 _ ^ ' 

subsequent to the first set of slots. According to yet another . *f f a flow f a P h of a ^ IDCT ^ onthm Performed 

embodiment, the data bus can load operands in a successive m 1 on ^ °P erands ^^rdrng to ao operand 

pair of the memory locations to a pair of sub-slots arranged routin S technique of the present invention; 

in successive but dissimilar slots. According to yet another 50 FIG * 5 1S a block ^& im of a computer system embody- 

embodiment, the data bus can load every other successive m % a cpu linked, according to one embodiment, by a 

memory location of the plurality of memory locations to coprocessor bus to an MEU of the present invention; 

sub-slots within the successive plurality of slots; the other FIG. 6 is a block diagram of a scalar microprocessor 

sub-slots of the plurality of slots are loaded with immediate having a decode unit for concurrently dispatching integer 

zero values. Converse to the loading operation, the data bus 55 and non-integer instructions to an integer execution unit and 

can also perform the above embodiments for various store an MEU of the present invention; 

instructions. FIG. 7 is a block diagram of a superscalar microprocessor 

The present invention still further contemplates a vector havin g a decode unit for concurrently dispatching multiple 

instruction for swapping operands between sub-slots within integer instructions along with non-integer instructions to 

one or more slots. More specifically, the swapping instruc- 60 respective integer and MEU execution units of the present 

tion can be used to exchange operands within sub -slots of invention; 

one source register (or a pair of source registers) to sub-slots FIG. 8 is a block diagram of an MEU having a series of 

of a destination register. Thus, 10 bits of a 20 bit operand can vector registers, an ORU, and a vector ALU according to the 

be routed across an upper/lower sub-slot boundary to rear- present invention; 

range the upper and lower half bit locations. For example, 65 FIG. 9 is a block diagram of operands within a source B 

upper and lower halves of a 20-bit operand within a slot of vector register routed by the ORU and operated upon by the 

one register can be exchanged and placed in a slot of a vector ALU; 
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FIG. 10 is a block diagram of an operand undergoing a type be performed during each instruction cycle, two opera- 
conditional move operation from source B register to des- lion types are available to enhance IDCT throughput, 
unation register; Routing operands upon a partitioned ALU helps increase 

FIG. U is a block diagram of an operand undergoing an algorithm throughput. The mechanism for routing operands, 

accumulate/merge operation, wherein the destination regis- 5 however, is designed outside of, and is merely an extension 

ter provides an input to the vector ALU; of, a standard processor or CPU. Thus, routing of operands 

FIG. 12 is a block diagram of an operand undergoing a is performed by an MEU external to a CPU core or, if need 

copy from partition s to partition s+1; *>e, external to the entire CPU monolithic circuit. FIG. 5 

FIG. 13 is a block diagram of data within sub-slots being in the latter instance in which an MEU 90 islinked 

routedacrosssub-slotboundariesorconcatenatedwithother 10 ™ a ^processor bus to a microprocessor 92. MEU 90 

register sub-slots to provide intra slot routing; fan f 10DS * a °°P roccss ? r Acting P™t unrt, but 

„ rt , .„ . , contains other features unique to multimedia operations. 

FIG. 14 illustrates expansion and truncation of bits during Microproce ssor 92 includes any integer-based 

respective load and store operations; microprocessor, a suitable microprocessor being one 

FIGS. 15, 16, 17 and 18 illustrate load and store opera- 15 designed in accordance with the x86 microprocessor archi- 

tions which occur, according to various embodiments, lecture developed by Intel Corp. MEU 90 and microproces- 

between a vector register and memory; and sor 92 thereby form a computer system 94 having both 

FIG. 19 illustrates saturating arithmetic performed upon hardware and software components. To receive external 

fixed point, signed values according to the present invention. input and/or to operate upon a stored sequence of 

While the invention is susceptible to various modifica- 20 instructions, computer 94 includes units peripheral to a CPU 

tions and alternative forms, specific embodiments thereof bus such as an interrupt controller 96, a bus bridge 98, and 

are shown by way of example in the drawings and will a plurality of input/output devices \02a-n. A CPU bus, often 

herein be described in detail. It should be understood, referred to as the system bus, couples microprocessor 92 to 

however, that the drawings and detailed description thereto controller 96 and bus bridge 98, as well as main memory 

are not intended to limit the invention to the particular form 25 104 - I/O devices 102n-102n are coupled to controller 96 and 

disclosed, but on the contrary, the intention is to cover all Dus brid S e 98 via thc I/O bus. 

modifications, equivalents and alternatives falling within the I/O devices 102 typically require longer bus clock cycles 

spirit and scope of the present invention as defined by the than microprocessor 92 and other devices coupled to the 

appended claims. CPU bus. Ttjus, bus bridge 98 includes any device which can 

30 provide a buffer between the CPU bus and the I/O bus. 

DETAILED DESCRIPTION OF PREFERRED Additionally, bus bridge 98 translates transactions from one 

EMBODIMENTS bus protocol to another. Apopular I/O bus includes the EISA 

r. c * no a a u r i o mrrr i «u or PCI bus. I/O devices 102 involve any device which can 

Referring to FIG. 4, a flowgraph of a 1x8 IDCT algorithm . 4 c . ~^ . nA , J . . lt 

. , * , ' b * . & interface between computer 94 and other devices external to 

is shown. The algorithm employs multiple add operations 35 JU . , - , r . . . , „ , . 

_ n . , , * «m ? i*- 1 ia 11 t the computer, and include a modem, a serial or parallel port, 

70, subtract operations 72 and multiply operations 74, all of , w • in^-ij *i . hah* *■ 

,\ , S ■ • *u 1 etc. Mam memory 104 includes at least one RAM array 01 
which are performed in no more than six instruction cycles 

1 i_ 1 j « or * 1 • . ,1 ... t . , cells and a RAM controller, 
labeled 76-86. According to the embodiment shown, each 

instruction cycle can perform up to eight operations. An Generally speaking, microprocessor 92 executes 

additional accumulate/merge operation, is denoted by dotted 40 se q uences of instructions ("programs") stored in main 

lines 88. Thus, add operation 70a adds the contents resulting memory 104 and operates upon data stored in main memory 

from add operation 70 to the output from multiply operation 104 Concurrently, MEU 90 also operates upon instructions 

74<z to produce an accumulated output. Some of the add and within main memory 104. Hie instructions unique to MEU 

subtract operations 70 and 72 are indicated as no operations 90 are deemed vector instructions useful in performing, for 

("no ops") whenever an input to that operation is an imme- 45 exam P le > dala compression or transformation operations, 

diate 0 value. Likewise, multiply operations 74 can be such as roCT shown in FIG - 4 

designated no ops depending upon an immediate 1 input When embodied upon a separate monolithic substrate, 

value. MEU 90 communicates to processor 92 via a coprocessor 

FIG. 4 illustrates sixteen multiply operations 74 and thirty bus. As will be described, MEU 90 is scalable in its 

two add/subtract operations 70 and 72. Four of the add/ so operation, and can perform any algorithmic or Boolean 

subtract operations are no ops, and two of the add operations combination useful in data compression, correlation, 

are accumulate operations, leaving 26 add/subtract opera- convolution, FIR, IIR, transforms (FFT or DCT/IDCT), 

tions in accordance with the operations allocated for a 1x8 and /°r matrix computations on a received signal. According 

IDCT computation. See, e.g., Bhaskaran, et al, "Image to a preferred embodiment, the signal is an image (either still 

Compressions Standards and Architectures" A CM Multime- 55 or full motion), whereby MEU 90 can perform fast matrix 

dia 94, October, 1994. The IDCT computations shown in computation on picture elements within macro blocks of 

FIG. 4 are performed on 16-bit values, and includes inverse select image frames. 

quantization. The MEU includes an ORU which routes FIG. 6 illustrates an embodiment in which the MEU is 
operands in accordance with branches 89, shown in FIG. 4. formed as part of a microprocessor, preferably on the same 
The operands are routed from one vector to another vector 60 monolithic substrate. The integer and non-integer (i.e., MEU 
(i.e., from slot to slot) so that they are optimally aligned for vector) elements of the microprocessor are designed to 
the operations performed on them. Each instruction cycle is execute instructions concurrently with one another. Accord- 
shown to perform eight independent operations on eight ing to one embodiment, a processor UOfl which includes 
separate operands. Two dissimilar types of operations can both integer core and MEU features is shown in FIG. 6. 
occur during each cycle. For example, both add and subtract 65 Microprocessor 110a includes an instruction cache 112 
operations can be performed in a given cycle on dissimilar coupled to a decode unit 114 which is in turn coupled to 
operands. Thus, if IDCT requires more than one operation execution units (or arithmetic logic units) unique to integer 
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and vector operations. Vector ALU 116 is shown as part of reserved as a sign bit). Conversely, integer numbers define 

MEU 90, and integer ALU 118 is shown as part of the the binary point immediately right of the rightmost bit. The 

integer core. Microprocessor UOo also includes a data cache fixed point values thereby range between -1.0 to +1.0. It is 

120 coupled between the integer ALU 118 and a bus known that integer values typically exceed 1.0 or are less 

interface unit 122. Of course, there can be numerous other 5 than -1.0 by incremental integer (non-fractional) amounts, 

functional blocks associated with microprocessor MQa such Bus interface unit 122 is configured to effect communi- 

as, for example, register files and writeback stages associ- cation between microprocessor 11 Oa and devices coupled to 

ated with the integer ALU 118. the CPU bus. For example, instruction fetches which miss 

Instruction cache 112 is a high speed cache memory instruction cache 112 may be transferred from main memory 

capable of storing and retrieving instruction code. It is noted 10 attached to the CPU bus by bus interface unit 122. Similarly, 

that instruction cache 112 may be configured as a set- memory operations which miss data cache 120 may be 

associative or direct-mapped cache. Instructions fetched transferred from main memory by bus interface unit 122. 

from instruction cache 112 are transferred to decode unit 114 Additionally, data cache 120 may discard a cache line of 

which decodes the instructions to determine the operands data which has been modified by microprocessor 110a. Bus 

used by the instruction as well as to bit-encode the instruc- 15 interface unit 122 transfers the modified line to the main 

tion for the execution units of vector ALU 116 and integer memory. 

ALU 118. Decode unit 114 fetches register operands from Turning now to FIG. 7, microprocessor 106 is shown 
register files (either vector registers or integer registers). according to an alternative embodiment. Similar to micro- 
Within MEU 90, ORU 124 re-aligns the operands within one processor 110a, microprocessor llOfc includes a bus inter- 
source register prior to their entry into vector ALU 116. In 20 face unit 122, an instruction cache 112, a data cache 120, and 
this manner, vector ALU 116 receives register operands integer registers 126. Bus interface unit 122 is coupled to 
during the same clock cycle that it receives instructions. instruction cache 112 via an instruction transfer bus. 

In addition to fetching register operands, decode unit 114 Similarly, bus interface unit 122 is coupled to data cache 120 
routes each instruction to integer ALU 118 or vector ALU via a data transfer bus. Additionally, microprocessor 1106 
116 based on the type of instruction encountered. Vector 25 includes a multiple instruction decode unit 130 coupled 
instructions are routed to vector ALU 116, while integer between instruction cache 112 and a plurality of execution 
instructions are routed to integer ALU 118. Integer ALU 118 units 132a-132/i. A load/store unit 134 is included to inter- 
may include an execute stage and a writeback stage. The face between execution units 132 and data cache 120. 
execute stage executes the instructions provided by decode Microprocessor 1106 includes a reorder buffer 131 
unit 114, producing the result. Integer ALU 118 often 30 coupled to decode unit 130, execution unit 132 and load/ 
utilizes a memory operand, wherein the memory operand is store unit 134. The reorder buffer 131 allows concurrent 
transferred from data cache 120 prior to execution of the execution of multiple integer instructions carried forth in 
instruction. The writeback stage stores the result generated what is generally termed a "superscalar" architecture, 
by an execute stage into a destination register specified by Decode unit 130 therefore concurrently decodes multiple 
the instruction. The destination of an MEU operation (i.e., 35 instructions and dispatches the instructions to the appropri- 
vector operation) is generally a destination register within ate execution unit 132a-132/i. Additionally, decode unit 130 
vector registers 128. Vector store instructions have a desti- dispatches vector instructions to MEU 90 concurrent with 
nation in main memory 104 (shown in FIG. 5), a copy of the integer instructions. A storage location within reorder 
which may be stored in data cache 120. Similarly vector load buffer 131 is allocated for each decoded and dispatched 
operations have a source operand in main memory 104, a 4 q instruction. The storage locations are allocated to ins true- 
copy of which may be stored in data cache 120. tions in the order in which they occur within a task, so that 

Vector ALU 116 responds to a decoded instruction code, the results created by executing instructions may be stored 
and the vector ALU result is written to a destination speci- into register file 126 or data cache 120 in program order. By 
fled by the vector instruction. More particularly, decode unit including reorder buffer 131, instructions may be specula- 
114 provides control signals regarding operand routing to 45 tively executed out of order by execution units 132. Thus, in 
ORU 124 and control signals regarding instruction operation one embodiment, MEU 90 is designed to operate concur- 
to vector ALU 116. These control signals are generated rently with multiple issued instructions, speculatively 
according to the vector instruction fetched from instruction executed out of order by multiple execution units 132. The 
cache 112. An exemplary vector instruction encoding is superscalar architecture shown in FIG. 7, as it applies to 
provided hereinbelow. 50 speculative execution, is well known; however, the addition 

Integer register 126 is configured to store register oper- of an MEU execution unit presents additional advantages 

ands for use by integer ALU 118. In one embodiment, unique to both high speed conventional processors and 

registers 126 store the x86 register set which includes the DSPs, 

EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP registers. Turning now to FIG. 8, data paths connecting various 

Additionally, integer registers 126 may store the segment 55 components of MEU 90 are shown according to one 

registers of the x86 architecture, as well as other miscella- embodiment. Advantages illustrated by features of FIG. 8 

neous registers. Conversely, vector registers may comprise include, for example, a structure by which vector instruc- 

the registers within the floating point unit. According to one tions can operate upon 160-bits of data at a time, operand 

embodiment, vector registers 128 comprise eight of the slot and sub- slot segregation of the 160-bit wide registers, 

80-bit floating point registers available in an x87 architec- 60 saturating arithmetic performed on operands of fixed-point 

hire. According to one embodiment, two 80-bit registers are values, support for data scaling from 8/10 bit to 16/20 bit 

coupled together to form a 160-bit register and a 160-bit data values, multimedia-type vector instructions and their impact 

bus connecting vector registers 128 and vector ALU 116. on arithmetic operations, operand routing and operand 

Each slot of each vector register can store a 20-bit operand loading/storing. In a full implementation, microprocessor 

which comprises a fixed point number. The fixed point 65 110a or 1006 can perform up to sixteen 10-bit arithmetic 

number is a non-integer number, wherein the binary point is operations per instruction cycle. In less than a full 

immediately right of the leftmost bit (the leftmost bit is implementation, the registers can be 80-bits wide or 40-bits 
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wide rather than 160-bits wide to save silicon space at the of 20-bit operands within slots of a source B 160-bit register, 

expense of performance. In the later instance, eight or four Operands within slots of source B register are re-ordered or 

10-bit arithmetic operations, respectively, can occur per aligned with operands within slots of source A register or 

clock cycle. In a full implementation which performs sixteen destination register, each register of which are 160-bits in 

operations per cycle, a microprocessor operating at 150 5 length and contains 20-bit slots. For example, if bits within 

MHz can give a peak performance of 2.4 billion calculations the first slot of source B register are to be added with the bits 

per second. The concept of using partitions of various sizes within the second slot of source A register, ORU 124 

(i.e., 10-bit or 20-bit operands) or a variable number of reorders the operand bits in slot 1 to slot 2 of source B 

registers, ORU multiplexers and vector ALU logic portions register consistent with slot 2 of the source A register. Vector 

is referred to as scalability— an important advantage of the 1Q ALU 116 then combines slot 1 of source A register with slot 

present invention. 1 0 f source B register, slot 2 of source A register with slot 

In order to minimize the hardware and software impact 2 of source B register, and so forth. A single vector instruc- 

upon existing processor or coprocessor cores, MEU 90 may tion controls routing of each partition or vector applied to 

use the pre-existing registers of many existing floating point ALU 116. ORU 124 operates on operands within source B 

units. Vector instructions treat the registers slots as contain- 15 register (either vO, vl, v2 or v3) as the operands are fed into 

ing small fixed-point data values rather than large floating- vector ALU 116. ALU 116 and ORU 124, in combination, 

point numbers. Since operating systems save the entire state allow a microprocessor to execute an algorithm in a fashion 

of the floating point unit as necessary during context that directly follows the algorithm's flowgraph representa- 

switches, the operating system does not need to be aware of tion. Extraneous move, load and shift operations are there - 

the new functionality. It is important to note that the MEU ^ fore substantially minimized. At each level in the flowgraph, 

and floating point unit do not necessarily need to share such as the flowgraph of FIG. 4, ALU 116 operates on the 

vector ALU logic or vector registers. The microprocessor nodes within the flowgraph and ORU 124 implements the 

could simply have a mechanism that maintains coherency diagonal interconnections. This feature provides high per- 

between the vector register values in completely separate formance and makes MEU 90 easier to program at the 

MEU and floating point unit sections. 25 assembly-language level, since the instructions map directly 

There are several advantages in reusing the floating point onto an algorithm's flowgraph representation, 
unit registers as vector registers 116. First, floating point unit In a lower performance implementation, MEU 90 can be 

register files can hold almost three times as much data as formed with 80-bit or 40-bit datapaths. Instead of perform- 

integer registers, and can be used concurrently with the ing eight computations within a single cycle, the lower 

integer registers. Second, the MEU implementation does not 30 performance implementation can operate on pairs of four or 

impact or change the integer registers or load/store units. two operands. Thus, two or four clocks may be needed for 

Thus, the MEU can be optionally included or excluded from each vector instruction, given the lower performance imple- 

almost any conventional microprocessor as merely an mentation. The scalability benefit of variable performance 

"extension" to the processor core. An advantage of applying allows use on many types of algorithms. For example, there 

the MEU to existing processors is for reasons of scalability 35 may be algorithms in which fewer than eight operations are 

or modularity. MEU performance can be readily changed to needed in a single cycle, in which case 80-bit or 40-bit 

fit the application without changing the processor whatso- registers would serve that application. At the highest end, as 

ever. Third, MEU instructions issued on the floating point many operations as possible (i.e., eight) are performed in a 

unit register occur concurrently with integer instructions single clock cycle. There also may be instances in which 

issued upon the integer registers so as to obtain maximum 4Q multiple MEU units 90 might be considered to further 

utilization of all microprocessor logic. MEU 90 is used to enhance performance. A lower performance implementation 

perform large numbers of parallel computations while the requires a lesser amount of added die size since fewer 

integer units (integer ALU and integer registers in scalar multiplexers are needed in ORU 124, and fewer partition 

form or integer ALU, integer registers, load/store, reorder logic elements are needed in vector ALU 116. The perfor- 

buffer, etc. in superscalar form) perform addressing calcu- 45 mance vs. die size tradeoff can therefore be adjusted to suit 

lations and program flow control. Parallel computations in the intended application for any particular scalar or super- 

the MEU occurs simultaneous with addressing and program scalar microprocessor. There is considerable risk involved in 

flow control without hindering the normal operations of doing enhancements that provide a fixed performance 

microprocessor 110a or 110b. Fourth, the MEU does not increase relative to the core microprocessor. As data com- 

define any new microprocessor states, control or condition 50 pression formats change, and demands upon DSPs change, 

code bits other than a global MEU extension enable bit. present performance -enhancing architectures will not be 

In high performance implementations, all eight of the sufficient unless they are scale able— either in terms of 

80-bit floating point registers are utilized, and the registers hardware scale or partition data size. By adding a flexible 

are accessed in pairs. This effectively creates four 160-bit performance MEU 90 to a low end processor core, a cost 

vector registers 116, denoted in FIG. 8 as vO through v3. Bit 55 competitive, and relatively inexpensive DSP is accom- 

coding is reserved in the MEU instruction format for future plished within the existing processor framework. The DSP/ 

expansion to possibly eight 160-bit vector registers. The processor advantageously uses the same enhanced x86 

extra registers would be used to implement future perfor- instruction set that was developed for the host processor, 

mance enhancements such as software pipelining. thus greatly reducing software development cost. 

Each of the 160-bit registers are partitioned according to 60 Vector ALU 116 can support a three-operand instruction 

a preferred embodiment into eight 20-bit slots or sixteen format. Operations such as addition, subtraction, multiply 

10-bit sub-slots. In order to operate upon a slot or a sub-slot, and shift utilize operands from source A and source B as the 

a partitioned ALU is necessary. The partitioned ALU, input to ALU 116. However, other operations, such as 

denoted as vector ALU 116, is divided into separate logic multiply-and-accumulate combine operands within a source 

units which perform discrete operations. In order to route 65 register with operands within the destination register, 

slots within a rather large 160-bit register, a slot router or wherein the destination register is the implied third operand 

ORU 124 is necessary. ORU 124 serves to change the order to ALU 116. The written result can be immediately stored 
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and used for subsequent operations without involving FIG. 9 indicates selection of slot s from slots 0 through 7 

unnecessary move instructions between registers or stores to of the source B register. One of the logic portions 117a-U7/t 

memory. of ALU 116 combines the selected slot with another slot s 

ALU 116 supports heterogeneous operations on the par- ^thin source A register. The result of that combination is 
titioned registers 128. According to one embodiment, two 5 presented to slot s of the destination register. Each input to 
types of operations can be performed in a single insuiicUon. ^ ^ m Qr eacfa dol m ^ destination ^ can 
These operations can be assigned to each operand within . j r , - , . , . 
each slof(or sub-slot) of each source register For example, independently receive one of eleven values: a value in one 
four additions and four subtractions can be performed in a of mc CI S ht source slots * immediate °> immediate 1.0 or 
single cycle upon eight pairs of operands. By having capa- 1Q immediate -1.0. The opcode mnemonic uses a character to 
bility of two operation types, it is easier to map algorithms represent each choice. Thus, given the above order, the 
containing numerous dissimilar operation types onto each mnemonic is represented as 0 12345 67ZPN. Each ORU 
instruction cycle. Thus, if two adds and six multiplies are mnemonic uses eight of these characters to represent the 
needed, followed by four adds and four multiplies, the routing operation. The following code illustrates a simple 
heterogeneous operation scheme hereof can perform these copy ope ration followed by an operation that would inter- 
operations within two cycles, rather than having to separate kave the low ha]f sub . slot of one re ^ stef ^ ^mring 1 
the operations mto four cycles. and _ 1 values . 

FIG. 9 illustrates in more detail ORU 124 and ALU 116. 
It would be difficult for software alone to take advantage of 

the raw number of micro instructions per second offered by 

ALU 116 without a means to flexibly move operands within 20 ;c0 py vi to vo 

and between large 160-bit registers. Operand routing is more {mov mov mov mov mov mov mov mov} word vO, vi, vi(765432io) 

critical for a vector processor (i.e., a vector ALU employing ; move of vi to v3, performing interleave with v& and -l's 

vector registers) than normal scalar processors which {mov mov mov mov mov mov mov mov} word v3, vi, vi(P3N2PiNO) 
employ a smaller integer ALU and integer registers. Scalar 

processors can use memory addressing to randomly access is 

individual operands; however, a vector processor must load Referring to FIGS. 6-9, slot s is decoded from a vector 

data from memory in larger monolithic bit streams. Without instruction within instruction cache 112 by decode unit 

the ability to flexibly access and route individual operands, 114/130. The decoded instruction is forwarded to ORU 124, 

algorithms often must be structured to perform a single and specifically to the multiplexers 125fl-125*, to select a 

operation on a larger portion of the data before moving on 30 s i 0 t s as shown in FIG. 9. Likewise, the same vector 

to the next operation. This puts a substantial burden on the instruction is decoded and forwarded to vector ALU 116, 

memory load and store bandwidth because the intermediate afld cificall to { ic ^ n 7o -117/i, to select an 

result ; between operations do not all fit m the vector register ion tfae dfi me ^ 

fie. Moreover, me memory reference Pattern for this mode AccQxd[ { m instniction which decodes a slot (or su5 . 

of calculations tend to use stride patterns that are highly , tX c & 7. ( (L ATIT , , , , v 

u**, t-u* *i i j 35 slot) for routing to the ALU and which decodes an operation 

inefficient in cached architecture. The typical workaround . j i . / i_ i . \ • ^ « j 

r 4 . . . , , j . , , . f upon those routed slots (or sub-slots) is defined as a vector 

for this problem would be to perform large numbers of • , , . ™_ 4 V , y r ^ . A 

t ' 4 r iii -*l * mstruction. There are two classes or vector mstructions 

mtra-register moves that consume clock cycles without , c , c t . „ rTT t *• i • . a 

, . n i , , rj, , .1 Am t defined for the MEU: vector operational instructions and 

dome useful calculations. To solve this problem. ORU 124 , , . . , r 

• J . j . . 1 - >IK . t ' ! . f vector load/store mstructions. 

is devised, wherem ORU 124 swizzles bits within slots of , rt 

vector registers as data moves through the ORU. Swizzling Vector operational instructions use a single opcode format 

or realigning the data allows the operands to be shuffled as for simultaneously controlling ALU 116 and ORU 124; this 

needed by the algorithm concurrently with ALU 116 arith- format is approximately 8 bytes long. Each instruction 

metic operations. MEU 90 can thereby load data slots, do a encodes the two source registers, the destination register, the 

variety of operations between data slot elements, and then 45 partition size, and the operations to be performed on each 

store the final result without involving numerous memory partition. In addition, each instruction encodes the ORU 

accesses. Load/store units are therefore less likely to be routing settings for each of the eight slots. According to a 

overloaded, leaving free bandwidth for the x86 integer ALU preferred embodiment, the following represents a vector 

and integer registers to do basic addressing, execute, and operational instruction coding format which occurs after the 

writeback operations. OFh F8h opcode: 



OOOOsOaa ObbOddOx xxxxxOyy yyyypppp ppppAAAA AAABBBBB BBCCCCCC CDDDDDDD 



FIG. 9 indicates ORU 124 as comprising a series of 
multiplexers 125a-125n, which may be thought of as essen- 
tially an 8x8 crossbar switch with some enhancements. Each 
multiplexer of ORU 124 selects one slot s of a plurality of 60 
slots, labeled in FIG. 9 as slot 0 through slot 7. Each slot 
contains either one 20-bit partition or two 10-bit partitions 
(i.e., two sub-slots), depending upon the partition width 
specified in the vector instruction. For 10-bit partitions, the 
MEU 90 simultaneously performs independent but identical 65 
types of operations (i.e., two adds, two subtracts, etc.) on 
sub -slot pairs within each slot. 



where, 

0= reserved; must be zero 
s=partition size (10 or 20 bits) 
aa^sourceA register 
bb=sourceb register 
dd=destination register 
xxxxxx-first operation code 
yyyyyy=second operation code 

pppppppp=»l-bit operation selects for 8 slots (op xxxxxx 
or yyyyyy) 
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AAAAAAA to DDDDDDD-router slot selection fields 
It is noted that selection of a slot by ORU 124 is coded by 
fields described above as AAAAAAA to DDDDDDD. The 
slot selection field format utilizes 7-bits to represent the 
eleven possible routing values of source B slots 0-7, imme- 
diate 0, immediate +1 and immediate -1 for two slots. Use 
of 7-bits maximizes the coding density by coding pairs of 
slots within each field. 

Slot selection fields are best described in reference to an 
example. An exemplary encoding can be presented to ORU 
124 for each slot in order to select one of the eight source B 
slots, or one of the immediate values as follows: 



20 



If A(3) equals 0, then slot 0 uses an immediate value and slot 
1 uses one of the source B registers as follows: 

S slot 0 encoding- 10A(5)A(4) 
slot 1 encoding=0A(2)A(l)A(0) 

An example of how various codings of field A, given the 
above exemplary explanation, would route various slots of 
10 source B register to slot 0 and 1 is as follows: 







Field A coding 










slot 0 








slot 1 




6 


5 


4 3 2 


1 


0 


3 


2 


1 0 




3 


2 


1 0 




1 


0 


1 0 1 


1 


0 


0 


0 


1 0 


(slot 2) 


0 


1 


1 0 


(slot 6) 


0 


1 


1 0 0 


1 


0 


1 


0 


0 0 


(+1.0) 




0 


1 0 


(0) 


0 


1 


0 1 0 


0 


0 


0 


0 


0 0 


(slot 0) 


1 


0 


1 0 


(0) 


0 


0 


1 0 1 


0 


0 


1 


0 


0 3 


(-1.0) 


0 


1 


0 0 


(slot 4) 



0000- >slot 0 of source B 

0001- >slot 1 of source B 25 

0010- >slot 2 of source B 

0011- >slot 3 of source B 

0100- >slot 4 of source B 

0101- >slot 5 of source B 30 

0110- >slot 6 of source B 

0111- >slot 7 of source B 

1000- >+1.0 

1001- >-1.0 35 
1010->0 3 

Coding each destination slot source B operand indepen- 
dently would thereby require 4*8=32 bit. However, since 
there are only eleven possibilities per destination slot, two 
destination slots* encoding may be combined into a field 4Q 
AAAAAAA to DDDDDDD. The fields are used to generate 
encodings for select pairs of slots using one 7-bit field for 
each pair as follows: 

Field A-destination slots 0 & 1 

Field B-destination slots 2 & 3 45 
Field Odestination slots 4 & 5 
Field D -destination slots 6 & 7 
For example, consider field A. Field A has 7-bits numbered 

0 through 6. Examination of bit A (6) reveals that if it is set 

to a 1, then both slot 0 and slot 1 use source B register slots 50 
as follows: 

slot 0 encoding=0A(5)A(4)A(3) 

slot 1 encoding=0A(2)A(l)A(0) 
If A(6) equals 0, then examine A(5) and A(4), such that if 5S 
A(5) equal 1 and A(4) equals 1, then each of the slot 0 and 

1 use one of the immediate values as follows: 
slot 0 encoding=10A(3)A(2) 

slot 1 encoding«10A(l)A(0) 
If A(5) does not equal 1 or A(4) does not equal 1, then one 60 
of the slots uses a source B register slot and the other uses 
one of the immediate values. A(3) is used to determine 
which of the slots is which. If A(3) equals 1, then slot 0 uses 
a source B register and slot 1 uses an immediate value as 
follows: 65 

slot 0 encoding=0A(2)A(l)A(0) 

slot 1 encoding-10A(5)A(4) 



Fields B through D similarly encoding source B operand 
selection for destination slots 2^7. Using an example to help 
explain a vector operational instruction, the mnemonics used 
to specify the operations performed on each slot, the source 
and destination registers and ORU routing for an exemplary 
two-operation type add/subtract instruction are as follows: 

{sbr sbr add add sbr add sbr add} word v3, v2, 
vl(37P3ZlN2) 

Routing is performed on source B slots and immediate 
values in accordance with an order 37P3Z1N2 to respective 
destination register slots 76543210. Thus, v3 is denoted as 
the destination register, v2 is the source A register, and vl is 
the source B register. Slots for the operand specifier and the 
routing specifier are laid out in decreasing order from left to 
right, wherein operands in each of slots 7 and 6 receive a 
subtract (sbr) operation, operands in slot 5 receive an add 
operation, etc. The "word" symbol specifies that the instruc- 
tion is performed on a 20-bit slot as opposed to a 10-bit 
sub -slot. A word is represented as two bytes, wherein each 
byte within memory is represented as 8-bits. When a byte (or 
two-byte word) is loaded into the registers, the byte or word 
is expanded to a 10-bit sub-slot or 20-bit slot, respectively. 
The routing specifier for source B using the example set 
forth above, is as follows: 

dest .7<= =-source A(s=7)+sourceB(s=3) 

dest.6<==-sourceA(s=6)+sourceB(s=7) 

dest.5<==source A(s=5)+#1 .0 

dest.4<==sourceA(s=4)+sourceB(s=3) 

dest3<==-sourceA(s=3)+#0.0 

dest .2<= =sourceA(s=2) +sourceB(s=l ) 

dest.l<==-sourceA(s=l)+#-l .0 

dest.O<o=sourceA(s=0)+sourceB(so2) 

A vector instruction can specify any two of various vector 
operations. Thus, each slot can be randomly assigned either 
of the two types of operations. For examples, operands in 
slots 0 through 3 could receive one operation type while 
operands in slots 4 through 7 receive another. There are 
numerous advantages in being able to apply two different 
operations within a single vector instruction, one of which is 
to enhance the flexibility by which operands are routed and 
operations performed. The following Table I defines each 
type of operation that can be used in a vector instruction: 
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TABLE I 



TABLE I-contirrued 



CATEGORY 



Vector Operation Descriptions 
MNEMONIC DESCRIPTION 



Add 



Subtract 



Accumulate/ 
Merge 



Negate 

Distance 

Multiply 



Conditional 
Move 



Scale 



Logical 
Shift 



add add_ 



sub sub_ 
sbr sbr_ 



acum a cum 



neg 
dist 



mul 

mac 



mvz mvnz 
mvgcz mvlz 



asr 
asl 



Isi 
Is) 



false nor bnota 

nota anotb notb 

nand and 

b borna 

aomb or 



xor 
nxor 



Round 



SourceA 
Partition 
Shift 



true 
rnd 



pshra 



Slot 

Routing 



blbh 
ahbh 
albl 



CATEGORY 



add sourccA and sourccB 
partilions, place sum in 
destination. add__ arithmetically 
shifts the result right by one bit 
(computes average), 
subtract partitions. Sub does 
sourceA - sourceB; sbr does 
source B - sourceA. Sub_ and 
sbr_ arithmetically shift the result 
right by one bit. 
add the contents of the 
destination register partition to 
the sourceB partition and place 
the sum in the destination. acum_ 
arithmetically shift the result 
right by one bit 

negate sourceB partition and place 
in destination. 

subtract partitions then perform 
absolute value, 
mul multiplies the sourceA 
partition by the sourceB partition 
and places the product in the 
destination, mac multiplies 
sourceA by sourceB and adds the 
product to the destination, 
conditionally move partition in 
sourceB register to partition in 
destination register depending on 
sourceA partition's relationship 
to zero. 

arithmetically shifts the operand 
in sourceB by amount n. N can be 
between 1 and 4 inclusive, asl 
uses saturating arithmetic and 
shifts zeros in from the right, 
asr copies the sign bit from the 
left. 

logically shifts the operand in 
sourceB by amount n. N can be 
between 1 and 4 inclusive. Zeros 
are shifted in from the left or 
right Lsl uses modulo 
arithmetic; it does not clip, 
perform one of sixteen possible 
Boolean operations between 
sourceA and sourceB partitions. 
(The operations are listed in order 
of their canonical truth table 
representations.) 

add the constant (1 *LSb « n - 1) 
to sourceB, then zero out the n 
lowest bits, n can be between 1 
and 4 inclusive. Implements 
"round-to-even" method: If 
(sourceB<n:0>—010 ... 0), then 
don't do the add. 
For each slot s, copy the contents 
of slot s + 1 from the sourceA 
register to slot s in the 
destination register. (If this 
operation is used in slot 7, then 
the result is immediate zero). 
This operation can be used to 
efficiently shift data inputs and 
outputs during convolutions (FIR 
filters, etc.) 

These operations arc defined only 
for 20-bit partitions. They are 
used to route 10-bit data across 
the even/odd "boundary" that the 
ORU doesn't cross. Blbh swaps the 
upper and lower halves of the 
sourceB operand and places the 
result in the destination, ahbh 



35 



45 



Vector Operation Descriptions 
MNEMONIC DESCRIPTION 



Store 

Conversion 



15 



Extended - 
Precision 



concatenates the upper half of the 
sourceA with the upper half of 
sourceB. albl concatenates the 
lower half of the sourceA with the 
10 lower half of sourceB. 

ws2u This operation is used prior to 

storing 16-bit unsigned data from 
a 20-bit partition. If bit 19 of 
sourceB is set, the destination is 
set to zero. Otherwise, this 
operation is the same as lsl 1. 
ernach These operations are used to 

emacl perform multiply-and-accumulate 

emaci functions while retaining 36 bits 

carry of precision in intermediate 

results; they are only defined for 
20-bit partitions, emach is the 
same as mac, except that no 
rounding is done on the LSb. 
emacl multiplies sourceA and 
sourceB, then adds bits <18:3> of 
the 39-bit intermediate product to 
bits <15:0> of the destination, 
propagating carries through bit 19 
of the destination, emaci is 
similar to emacl, except that bits 
<19:16> of the destination are 
cleared prior to the summation. 
The carry operation logically 
shifts sourceB right by 16 bits, 
then adds the result to sourccA. 
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There are several common operations which are desirable 
for vector ALU 116 to perform, and which comprise the 
specific methods of using the operations defined in Table L 
The specific uses are commonly referred to as aliases. 

By way of example, a common desirable operation allows 
computation of the average of two given operands. The 
40 vector ALU 116 does not explicitly provide an average 
operation. However, vector ALU 116 implicitly provides an 
operation to take the average of two operands in that an 
add_ instruction does compute the average of two operands. 
This is due to the fact that the average of two numbers equals 
the summation of two numbers divided by two. Since the 
add_ instruction shifts the sum of the two operands right by 
one bit, it in effect performs the average operation useful as 
an alias. The following Table II illustrates some common 
operations that are aliases of those shown in Table I: 



50 
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TABLE II 



Operation Synonyms 



CATEGORY 



ALIAS ACTUAL 
NAME OPERATION 



DESCRIPTION 



Move 
SourccB 



mov 
mov_ 



60 



65 



Move 

SourceA 

SourceA 

Absolute 

Value 

Unmodified 
Destination 



dest 



b 

asrl 



dist (. 



acum (. 



Move the sourceB register 
partition to the destination 
partition. Mov_ 
arithmetically shifts the 
result right by one bit. 
Copy the partition in 
source A to the destination. 
) Compute the absolute value 
of the sourceA partition. 

.) Leave the destination 
partition unchanged. 
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Operation Synonyms 




AUAS 


ACTUAL 


CATEGORY 


NAME 


OPERATION DESCRIPTION 


Average 


avg 


add Compute average of two 






values. 
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rithm ihat makes decisions based on the values of input data 

TABLE II-continued (which usually are not very predictable) should try to do the 

decision without branching. By altering program behavior 
without branching, conditional moves therefore prove use- 
ful. 

FIG. 11 illustrates in further detail an accumulate/merge 
instruction designated in Table I as a cum and acum_. An 
accumulate operation allows the vector ALU 116 to treat an 
operand within the destination register as a third source 

The various types of operations outlined in Tables I and op erand - Th us > operands within a destination register not 

II pose many advantages for DSP-type operations, and prove onlv receive accumulated values, but arc also forwarded as 

efficient in performing, for example, repetitive and sequen- source values into ALU 116. Using the destination register 

tial adds, subtracts, multiplies and shifts (moves). These as an implied third source register thereby achieves an 

types of operations are uniquely pertinent to, for example, ^ accumulate operation as part of an arithmetic operation. For 

operand routing, data compression, correlation, convolution example, FIR filters often accumulate products or sums to 

and transformation operations. Normally, any two of the form cumulative totals. Accumulation is particularly useful 

vector operations defined in the preceeding Tables I and II during the latter stages of an algorithm when intermediate 

may be specified in a single vector instruction. Each slot can results are being merged into the final result. The amount of 

be randomly assigned either of the two operations, and the data in an algorithm tends to be larger in the middle (interim) 

two sub-slots that share each slot always share the same of the operation than at the beginning or end. Thus, multiple 

operation. There is one case, however, where possibly four vector registers tend to get used in the middle portions of the 

operations can be selected in one instruction. In this case, the calculation. The acum instruction allows the final operation 

four operations are predefined to be add, subtract, reverse on an intermediate result register to be done concurrently 

subtract, and move. This special case is included because with the merging into the final result. Without acum, this 

these four operations are typically found in individual stages must be done with a move instruction that requires source A 

of flowgraphs used for most DSP algorithms. Use of four to be different from source B. It is less likely that useful 

distinct and differing operation types, or two differing opera- arithmetic operations can be performed in parallel with 

tion types reduces the number of instructions needed to moves. 

perform an algorithm. 3Q The acum instruction is described as an operation which 

FIG. 10 illustrates in further detail a conditional move adds the value or contents within one slot of a destination 

vector instruction designated in Table I as mvz, mvnz, register to the contents or value within another slot of a 

mvgez, and mvlz. Depending upon the value within slot s of source B register. The combined results are then placed back 

source A register, a move of the operand within slot s of into the same slot of the destination register from which they 

source B register may or may not occur. The relationship of 35 are fed. The instruction acum_ serves to arithmetically shift 

the value to zero determines whether or not the move will the combined result right by one bit. Shifting the result by 

occur. A mvz instruction causes an operand from source B one bit thereby computes the average value of that result, 

register to move to the destination register if the value of an Using fixed point arithmetic, instead of the value being, e.g., 

operand within slot s of the source register is equal to zero. 0.5, acum_ causes the binary value to shift to the right by 

Alternatively, a mvnz instruction provides a move if the one bit thereby forcing a 0.25. 

value of the operand within slot s of source A register is not FIG. 12 illustrates in further detail pshra instruction 

equal to zero. Further, a mvgez instruction allows movement shown in Table I. The pshra instruction is used to shift (or 

of an operand if the value of the operand within slot s of copy) the contents within slot s+1 of source A register to slot 

source A register is greater than or equal to zero. Yet further, s of the destination register. If pshra is used to shift contents 

mvlz instruction causes movement if the value of the oper- ^ within slot 7 of the source A register, then an immediate 0 

and within slot s of source A register is less than zero. The value will be shifted to slot 7 the destination register. The 

fixed point value within slot s of source A register therefore pshra operation is particularly useful as a right shift routine 

dictates the move. Since the value ranges from -1.0 to +1.0, from slot-to-slot. More importantly, pshra moves data to an 

comparisons and move operations can be readily performed. adjacent slot without involving the ORU. A single opera- 

Thus, this operation is particularly useful as a comparater 5Q tional vector pshra instruction can concurrently move data 

function. Comparisons are often needed as part of a logical across all the slots or select slots, depending upon the 

shifting operation. Since there are often numerous shift amount of movement required. This allows movement of 

operations which occur as part of a pixel format conversion, data from one slot to an adjacent slot while simultaneously 

conditional moves play a substantial part in unpacking moving new data into the vacated slot. The pshra instruction 

texture data from memory for use in calculation by the ss does not involve the ORU. Instead, movement between 

MEU. Probably the most significant advantage of condi- registers occurs exclusively within the vector ALU. This is 

tional moves is the elimination of branch instructions. Mod- useful for certain serial operations found in, for example, 

ern CPUs must try to correctly predict branches (such as the FIR filter algorithms. Exemplary code for an inner loop of 

branch instruction) to prevent stalls in the execution pipeline eight taps of an FIR filter algorithm involving the pshra 

caused by re-executing speculative instructions. Any algo- instruction is as follows: 



{mac mac mac mac mac mac mac mac} word v3, vl, v2(77777777) 

{mov pshra pshra pshra pshra pshra pshra pshra} word vl, vl vO(OZZZZZZZ) 
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The mac instruction is a multiply- and-acaimulate ins true- mation algorithms. Pixel format conversions within, for 
tion shown in Table I. The mac instruction places the example, an intracoded I frame can be performed using 
cumulative sum of source A and source B slot products into logical shifting and byte format moves. Byte moves are 
respective destination slots, where they are then shifted one performed on byte -sized data within sub -slots of various 
slot location to the right. The shift operation is carried out by s slots. A coded example of three instruction cycles used in 
a mov instruction and seven pshra instructions. The mov converting four pixels from 16 -bit 5:5:5 RGB format to 
instruction places an immediate zero value in slot 7 of the 32-bit 8:8:8:8 aRGB format is as follows: 
destination register, while pshra shifts slot values to the 
adjacent right slot in preparation for the next mac instruc- 
tion. Serial shifting of data brings elements serially forward 10 

in 0* FIR algorilhn, TT* pshra instruction is also useful in j^tf* S StfSZTM ^ 

IIR filters when getting new data values, or in any algorithm \\ Bn kr 4 isn lsr4 un hiA lsii isr4} word vo, vo, vo (3322noo) 

where only one or a few new data values are added at each pnove dta into byte formats from woid formats 

s tep {&lbl albl ahbh ahbh albl albl ahbh abbh} word vo, vo, vo (5476 1032) 

FIG. 13 illustrates yet another vector instruction useful in is ^ ^ h P utti fS o and b into order 

. . ;nnian lining up B, put each component tn the correct byte, and zero 

slot routing. More specifically, the instructions shown in ;0 ut alpha 

FIG. 13 are useful for moving data between upper and lower {mov mov bibh blbh mov mov bibb blbh} word vo, vo, vo(Z645Z20i) 

half sub-slots of one or more slots. There are four exemplary ;optional step: zero out low bits to eliminate noise vi slots- ill liooooo 

instructions blbh, ahbh, albl and blal shown in FIG. 13 -J and and and and and} byte vo, vl, v0(76543210) 

indicative of many slot routing instructions shown in Table 20 

I. There are sixteen possible permutations by which sub- The shift range n of Isl and lsr is limited to plus or minus four 

slots within source B register, source A register or source A bit positions to minimize the size of the shifting logic, which 

and source B registers are moved and placed within slots of also minimizes opcode size. Shifts from five to eight bits can 

a destination register. The operation blbh causes movement be done with two instructions; shifts of nine or ten bits 

of certain operands in accordance with reference number 25 require three instructions; and, shifts larger than ten can be 

140. More specifically, blbh operation causes the lower half constructed with help from ORU 124. 

sub-slot within slot s of source B register to be placed in the The vector operational instructions set forth in Table I of 

upper half sub-slot of slot s within the destination register, which a few are described in FIGS. 10-13, are representa- 

while the upper half sub -slot of slot s within source B tive of operations useful in numerous DSP algorithms, 

register is placed in a lower half sub-slot of slot s within the 30 However, to realize the full benefit of those operations, it is 

destination register. The term "bl" refers to the lower half necessary that the vector instructions also include unique 

sub-slot of source B register and since "bl" occurs first in the load/store vector instructions. To ensure data is optimally 

blbh series, routing is directed to the upper half sub-slot of arranged from memory to the source registers or from source 

the destination. If "bl" occurs last in, e.g., a bhbl series, then registers to destination registers, loading to particular slots 

routing would be directed to the lower half sub-slot of the 35 or sub-slots is crucial. The load operation, or conversely the 

destination. The operation ahbh causes transfer of sub-slots store operation, must be particularly attuned to fixed point 

to the destination register in accordance with reference values. Operations upon fixed point values use saturating 

numeral 142. Similar to other slot routing instructions, ahbb arithmetic. Arithmetic upon signed fixed point values is 

routes sub -slots from one slot of a source register to the same represented in two's complement form, with the most sig- 

slot of the destination register. For example, slot routing 40 nificant bit being the signed bit. 

occurs from sub -slots within slot 1 to sub -slots within slot 1, The distinction between fixed binary point (i.e., fixed 

etc. Thus, ahbh causes slot s upper half data of source A to decimal point at the leftmost position to the immediate right 

move to slot s upper half data of the destination simulta- of the sign bit) and integer operation is meaningful for 

neous with movement of slot s upper half data of source B multiplication operations. The binary point position is irrel- 

to slot s lower half data of destination register. Instruction 45 evant for addition, shifting and Boolean operations. The 

albl is shown to cause movement of sub-slot data in accor- binary point position shared by the product and either of the 

dance with reference numeral 144, while operation blal two factors of a multiplication operation is arbitrary. The 

causes movement of sub-slot data in accordance with ref- hardware's behavior determines the binary point position of 

erence numeral 146. Instruction albl serves to concatenate the remaining factor. FIG. 14 illustrates capability of the 

the lower half sub -slots of the source registers as opposed to so MEU in supporting either signed or unsigned data formats, 

concatenating the upper half sub-slots resulting from As shown, an unsigned 8-bit data value can be loaded from 

instruction ahbh. The instruction blal concatenates in reverse memory to a register partition (either a slot or sub-slot 

order from the result produced by instruction albl. depending upon the amount of unpacking required). If an 

ORU 124 serves to route slots; however, instruction such 8-bit unsigned value is loaded into a 10-bit sub -slot, move- 

as blbh, ahbh, etc. serve to change the order of sub-slots 55 ment of bits during that load operation is shown by reference 

within any of the routed slots. If the slots comprise 20-bit numeral 148. The memory from which the 8 -bit value is 

partitions, it is noted that not only can the order of the 20-bils loaded can be any storage device other than the vector 

be changed by the ORU with respect to other 20-bit slots, but registers. Generally speaking, memory is defined as semi- 

1 0-bit sub-slots within one or more of the 20-bit slots can conductor memory or random access memory. Store opera - 

also be reordered. As shown in FIG. 13, upper and lower 60 tion 150 serves to move data from the vector registers back 

halves of each slot within a destination register can be to memory. All data is assumed to be in little-endian format, 

loaded separately or in reverse order by upper and lower half Loading of a 16-bit value from memory to a 20-bit slot is 

sub -slots within any of the source registers. Sub -slot routing shown by reference numeral 152, and storing data values 

is performed within the bounds of the same slot used as the from a slot back to a memory is shown by reference numeral 

source and destination slot. 65 154. 

Routing of data across upper and lower half barriers of FIG. 14 illustrates load/store of either an 8-bit byte or a 

one or more slots proves beneficial in MPEG motion esti- 16-bit word from memory to sub-slots/slots and back to 
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memory. Distinct load/store instructions are defined for the 
two different partition widths. For 20-bit partitions, the 
MEU treats the memory word as a 16-bit signed value as 
shown by load operation 152. However, for 10-bit partitions, 
the MEU treats a memory byte as an 8-bit unsigned value as 
shown by load operation 148. The reason why an 8-bit byte 
is assumed to be unsigned and a 16-bit word is signed is to 
lessen the support necessary to take into account the large 
number of both signed and unsigned versions of both. In 
most cases involving image data, 8-bit values tend to be 
unsigned. For instances, most pixel values are 8-bit unsigned 
byte values. Conversely, 16 -bit values tend to be signed, 
such as when those values represent an audio signal. A single 
instruction may be added following a load or before a store 
to perform format conversion from the default 8-bit 
unsigned or 16-bit signed format to the desired format, if 
necessary. An example of code which can perform such a 
conversion to 8 -bit signed format is as follows: 



vldw vO, mem 128 

{Isl Isl Isl 1st lsl lsl Isl Isl} byte vO, vO, vO (76543210) 



The load instruction vldw places the actual sign bit just to 
the right of the binary point, and the vector logical shift 
instruction lsl moves the sign bit to the left of the binary 
point and pads the lowest bit with an immediate 0. For 
conversion to 16 -bit unsigned format, the following code 
can be used: 



vldw vo, mem 128 

{lsr 1st 1st lsr lsr 1st lsr lsr} word vO, vO, v0(76543210) 

The load instruction vldw places the most significant bit in 
the sign location, and the vector logical shift instruction lsr 
shifts this bit back to the most significant bit right of the 
binary point and places a 0 into the sign bit. 

Load 148 is shown in FIG. 14 to load an 8 -bit unsigned 
value from memory across the data bus to bit locations 1-8 
within the vector register partition. The signed bit and least 
significant bit are set to 0. The default 8-bit value is 
unsigned. As described above, if the signed/unsigned nature 
of the data does not match that assumed by the load 
instruction, then a separate logical shift operation can be 
used to translate the data after it has loaded. To load a 20-bit 
partition, a 16-bit signed value is drawn from memory, 
wherein the 16 bits are left justified and the four rightmost 
(least significant) bits are padded with zeros. As described 
above, if the 16-bit value is unsigned, then a 1-bit logical 
right-shift is performed after the load. 

Store operations perform the opposite data conversions 
from loads. Stores from a 20-bit partition place the parti- 
tion's left-most 16 bits into the memory word, ignoring the 
lowest four bits. Stores from a 10-bit partition first check the 
partition's sign bit (bit 9). If the sign bit is set, the MEU 
stores 0 to the memory byte thus clipping the negative value 
to 0. If the sign bit is not set, then the partition's bits 1-8 are 
directly placed in the memory byte. To store 8-bit signed 
data, a 1-bit logical right-shift must be performed prior to the 
store. To store 16-bit unsigned data, it is necessary to 
perform a left-shift and to clip negative values to 0 prior to 
the store. 

Load and store instructions can therefore move up to 
sixteen 8-bit bytes between memory and a register partition 
(sub-slot) or can move up to eight 16-bit words between 
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memory and a register partition (slot). For example, 8-byte 
loads and stores can be used to convert between byte- 
precision data and word-precision data. 

There are numerous types of load and store instructions 
which can move 10-bit or 20-bit data between memory and 
the vector registers. Table III provides a listing of various 
load and store instructions as follows: 

TABLE III 

Load apd Store Instruction. Pescrjptjons 



Instruction 
Type 



Mnemonic Format Description 



16-Byte, 
20-Bit Load 

8-Byte, 
20-Bit Load 



20 



16-Byte, 
10-Bit Load 



16-Byte, 
10-Bit Load 



16-Byte, 
20-Bit Store 

8-Byte, 
20-Bit Store 



35 



16-Bytc, 
10-Bit store 



16-Byte, 
10-Bit store 



vldw vd, mcml28 



vldw vdh, mem64 



vldb vd, meml28 



vldb vdh, mem 64 



vstw meml28, vs 



vstw mem64, vsh 



vstb mem 128, vs 



vstb mem 64, vsh 



Load destination register vd 
with 16 bytes of signed 16-bit 
data at address mem 128. 
Load slots 4 through 7 of 
destination register vd with 8 
bytes of signed 16-bit data at 
address mem 64. Set slots 0 
through 3 of vd to zero. 
Load destination register vd 
with 16 bytes of unsigned 8- 
bit data at address mem 128. 
Data is loaded using a 2:1 
byte interleave pattern. 
Load destination register vd 
with 8 bytes of unsigned 8-bit 
data at address mem 64. The 
upper half of each slot 
receives the memory values; 
the lower half of each slot is 
set to zero. 

Store source register vs to 16 
bytes of signed 16-bit data at 
address mem 128. 
Store slots 4 through 7 of 
source register vs to 8 bytes 
of signed 16-bit data at 
address mcm64. 
Store source register vs to 1 6 
bytes of unsigned 8-bit data 
at address mem 128. Data is 
stored using all byte 
interleave pattern. 
Store source register vs to 8 
bytes of unsigned 8-bit data 
at address mem64. The upper 
half of each slot is stored to 
memory; the lower half of each 
slot is ignored. 



45 



FIGS. 15-18 illustrate in further detail the load/store 
instructions set forth in Table III. Movement of data in 
accordance with vldw vd, mem 128 and vstw mem 128, vs 
instructions are shown in FIG. 15. 8-bit bytes 0 through 

50 F(hex) can be loaded in various ways from memory 160 to 
slots 0 through 7 of a vector register 128. Instruction vldw 
vd, meml28 provides a 20-bit load such that a load from 
memory at address a maps each slot s to the memory word 
at address a+2s. Accordingly, 20-bit loads to slot s occur 

55 from a consecutive pair of address locations 01, 23, 45, etc. 
The vstw mem 128, vs operation is shown in FIG. 15 similar 
to vldw vd, meml28 but for opposite data movement, i.e., 
from vector registers 128 rather than from memory 160. 
FIG. 16 illustrates instructions vldw vdh, mem64 and 

60 vstw mem64, vsh. Load vldw vdh, mem64 is carried forth 
from memory address a=0-7 to respective register slots 4-7, 
while slots 0-3 are set to 0, Similar to instruction vldw vd, 
me rn 128, instruction vldw vdh, mem 64 loads the destination 
register 128 with signed 16-bit data. However, instead of 

65 loading 16 bytes of data at address mem 128, vldw vdh, 
mem64 loads 8 byte of data at address mem64. FIG. 16 also 
illustrates a store operation, vstw mem64, vsh which causes 
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storage of operands in slots 4-7 to 8 bytes of signed 16-bit 
data at address mem64. 

FIG. 17 illustrates the vldb vd, meml28 and vstb mem64, 
vsh load/store instructions wherein 16 byte load and store 
operations occur in a 2:1 byte interleave pattern. A 10-bit 
load from memory address a maps the lower half of each 
slot s (i.e., lower half sub-slot) to the memory byte at address 
a+s; and it maps the upper half of each slot (i.e., upper half 
sub-slot) to the memory byte at address a+s+8. As a result, 
the MEU performs independent but identical operations on 
two sets of data thai reside in two adjacent 8 byte octets of 
memory. 

FIG. 18 illustrates in further detail vldb vdh, mem 64 and 
vstb mem64, vsh load/store operations. A vldb vdh, mem64 
instruction causes upper half sub-slots of each slot to receive 
memory values, and the lower half of sub-slot of each slot 
is set to 0. Conversely, a vstb mem64, vsh instruction stores 
the upper half of the sub-slot of each slot to a memory 
address, while the lower half sub-slot of each slot is ignored. 

Load/store mappings shown in FIGS, 15 and 17 allow 
ORU 124 to operate the same way regardless of the partition 
size specified in the vector instruction. Thus, ORU 124 can 
be implemented in a single set of 8-to-l multiplexers even 
though it handles two fundamentally different data types. 
FIG. 18 illustrates that an 8 byte load operation moves only 
half of the bits to the vector register. The entire 160-bit 
vector register, however, is updated by padding the bits 
within the unused sub-slots with 0s. This feature greatly 
simplifies the implementation of register renaming for the 
MEU because partial register updates do not occur. 

The interleave mapping for 10-bit partitions is completely 
transparent to the programmer as long as only 10-bit loads/ 
stores and vector instructions are performed on a given set 
of data. Interleaved mapping of 20-bit partitions is also 
transparent to the programmer if only 20-bit operations are 
performed. However, if 10-bit and 20-bit operations are 
mixed, then care must be taken to understand the mapping 
so that the expected results are produced. The interleaving 
can be very useful, for example, if a 10-bit load from an 
octet-sized memory location automatically expands and 
interleaves the byte-wide memory data to the upper portion 
of 20-bit partitions. The 20-bit operation can be immediately 
performed on this data without the need for explicit format 
conversions. Subsequently, 10-bit stores to octets can auto- 
matically perform the inverse 20-bit to 10-bit packing func- 
tion. Thus, the present store operation, namely vstb mem 64, 
vsh performs packing of n+4 bits within a slot of a vector 
register to n/2 bits within an address of the memory unit. 
Given n=16, 20-bit- to-8-bit packing can occur as part of the 
store operation. Additional operations, such as move or shift 
operations need not occur to perform a packing function. 
Packing serves to store the most significant bits from a slot. 
Unpacking is an operation by which n/2 bits from a memory 
address are loaded into n+4 bit locations within a slot. If 
n-16, then a load operation such as vldb vdh, mem64 causes 
8-bits within a memory address to be loaded into a 20-bit 
slot. Utilizing load and store functions in such a manner 
thereby avoids having to implement separate unpack and 
pack instructions, respectively, within the MEU instruction 
set. Accordingly, the same result can be achieved but with 
fewer instructions. For MPEG, 8 -bit pixels are unpacked to 
20-bit numbers for DCT or IDCT manipulations, then the 
results are repacked to 8-bit pixels. The internals of the DCT 
and IDCT operations require more than 8 bits of precision, 
to which packing and unpacking are particularly advanta- 
geous. 

FIGS. 15 and 16 illustrate 20-bit load and store 
operations, whereas FIGS. 17 and 18 illustrate 10-bit load 
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and store operations. For 10-bit load/store mappings, there is 
purposely no interaction between data in the upper half 
sub-slots and data in the lower half sub-slots. The ORU 
routes data only in terms of slots and docs not have the 
resolution to route sub-slots. Given the ability to interleave 
mapping, there can be no interaction between the data in the 
octet starting at address a and the data in the adj acent octet 
starting at address a+8, even though both data values can be 
loaded at the same time. Thus, for 10-bit operations, there is 
a barrier between memory octets that data does not cross. 
This barrier is mapped by the interleave loads and stores to 
the midpoint of each slot within each vector register. 

For 20-bit operations, there is no barrier limitation since 
each monolithic 20-bit ALU partition (i.e., logic portions 
117) covers both the upper and lower sub-slots of each slot. 
Whenever it is necessary to route 10-bit data across the 
barrier separating sub -slots, 20-bit slot routing operations 
blbh, ahbh, etc., are used. The following exemplary code 
demonstrates how 20-bit operations serve to route slots 
amongst each other, and sub-slots within certain slots: 



;16 video bytes are in data in memory (the MSB, A, is shown on left): 
;ABCD EFGH IJKL MNOP 

;nced to extract 8 unaligned bytes from center; FGHI JKLM 
;load 16 bytes into register vO (load does interleaving) 
vldb vO, byte ptr [esi] ;esi points to byte M P" 

;now vO contains AIBJ CKDL EM FN GO HP 
;in slots: 7766 5544 3322 1100 

;use 20-bit routing ops to move data across 10-bit routing barrier 
{mov mov mov blbh blbh blbh bibb blbh} word vO, vO, v0(21 076543) 
;now vO contains FNGO HPIA JBKC LDME - FxGx Hxlx JxKx LxMx 
;store 8 bytes into memory 
vstb byte ptr [edi], vOh 
;"[edi] contains FGHI JKLM 



Movement of data not only between slots, but between 
sub-slots is particularly helpful when performing MPEG 
motion compensation on 8 -bit pixel values. In the example 
shown above, a single load instruction which causes inter- 
leaving of 16-bytes, followed by four move and four sub-slot 
routing instructions performs the same function but in a 
more efficient manner than doing unaligned memory refer- 
ences. Thus, MPEG motion compensation on a 1x8 block is 
advantageously performed by a single interleaving load 
operation, followed by a single vector instruction containing 
three move operations (mov) and five sub -slot swapping 
operations (blbh) across five slot midpoints. 

All MEU instructions, whether those instructions are 
load/store instructions or vector operational instructions, are 
mapped into a single row in the OFh(hex) prefix section of 
the x86 opcode map. The MEU load and store instructions 
are used in normal modR/M-based instruction format, 
wherein 8 opcodes are used (one for each load and store 
variation). The reg field of the modR/M byte selects the 
vector register. The opcode may optionally be followed by 
a SIB byte and/or a displacement value. The following Table 
IV illustrates MEU instruction opcode map: 

TABLE IV 
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Instruction 



Instruction Opcode Map 
Encoding 



vldw vd, meml28 
vldw vdh, mem 64 
vldb vd, mem 128 
vldb vdh, mem64 



OFh FOh modR/M [SIB] (disp] 
OFh Flh modR/M [SIB] (disp] 
OFh F2h modR/M [SIB] [disp] 
OFh F3h modR/M [SIB] [disp] 
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TABLE IV-continucd 




Instruction Opcode Map 


Instruction 


Encoding 


vstw mem 128, vs 


OFh F4h modR/M [SEB] [disp] 


vstw mcm64, vsh 


OFh F5h modR/M [SIB] (dispj 


vslb mem 128, vs 


OFh F6h modR/M (SIB] (dispj 


vstb mem64, vsh 


OFh F7h modR/M (SIB] (dispj 


(All Vector Instructions) OFh F8h nn nn nn on nn nn no nn 



All MEU register-to-register vector instructions outlined 
in Table I share a single additional x86 opcode. The vector 
instructions do not use a modR/M memory reference. 
Instead, 8 bytes are added to the vector instructions to hold 
the vector instruction information and bits for future expan- 
sion 

The addressing mode (modR/M) byte specifies the regis- 
ters used by the instruction, as well as memory addressing 
modes. More particularly, the modR/M byte may specify a 
register value to be added to the displacement in order to 
form a memory address for the load/store instructions. 
Alternatively, the modR/M byte may specify that the SIB 
byte is included. The scale-index-base (SIB) byte is used 
only in 32-bit base- relative addressing using scale and index 
factors. A base field of the SIB byte specifies which register 
contains the base value for the address calculation, and ao 
index field specifies which register contains the index value. 
A scale field specifies the power of two by which the index 
value will be multiplied before being added, along with any 
displacement, to the base value, thereby forming a memory 
address. The optional displacement field (disp) may be from 
one to four bytes in length. The displacement field contains 
a constant which is added to one or more register values to 
form the address for the load/store instructions. 

FIG. 19 provides exemplary arithmetic operations per- 
formed on fixed point values according to several of the 
vector operational instructions shown in Table I. FIG. 19 
illustrates fractional representations of fixed point numbers 
which range in some instances between -1.0 to +1.0. One 
advantage of using a fractional, fixed-point format over ao 
integer format or a floating point format is that the magni- 
tude of the data does not grow with each multiply operation. 
Namely, the product of two numbers within the +1.0 to -1.0 
range, or between 0.0 and approximately 1.0 is another 
number within that range. Thus, even though the inputs and 
outputs of an algorithm may need to be scaled, it is less 
likely that the data will need to be re-scaled at each step. 

In FIG. 19, a 10-bit source A operand is represented as 
1.101011100 (binary), which corresponds to -0.3203 
(decimal). The most significant bit to the right of the sign bit 
is represented as 0.5 decimal, the next most significant bit is 
0.25 decimal, the next most significant bit is 0.125, and so 
on. Since the 10-bit source A operand is a negative value, 
two's complement arithmetic is used whereby the decimal 
values is added to a -1.0 (decimal) value to render the 
-0.3203 value. 

An important benefit of having extended precision gained 
by adding 25 percent more bits to the value as it is loaded 
into registers is the capability of creating a unified repre- 
sentation for signed and unsigned memory data. This rep- 
resentation retains all the information present from either 
format and eliminates the need to have different ALU 
opcodes for signed and unsigned data. Load operations 148 
and 152 in FIG. 14 illustrate the added precision and the 
capability of representing either signed or unsigned data in 
a single, unified format. Thus, regardless of whether the data 
in memory is signed or unsigned, data within the registers 
takes on a single signed format from which the vector ALU 
can operate from a single unified instruction regardless of 
whether the stored value was signed or unsigned. 
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The following Table V illustrates the extended precision 
offered by loading ao 8-bit byte into a 10-bit sub-slot, or 
loading a 16-bit word into a 20-bit slot: 

TABLE V 



MEU Data Format 'Value Ranges 









Binary 


Minimum 


Maximum 






Bit 


Point 


Representable 


Represenlable 


10 


Data Type 


Width 


Posn. 


Value 


Value 




Unsigned 


8 


0.8 


0.0 


0.9961 (1 - 2"*) 




Mem Byte 












Signed Mem 


8 


1.7 


-1.0 


0.9922 (1 - I' 7 ) 




Byte 








0.9980 (1 - 2~*) 


15 


Byte 


10 


1.9 


-1.0 


Register 
Partition 












Unsigned 


16 


0.16 


0.0 


0.9999847 (1 - 2~ 10 ) 




Mem Word 












Signed Mem 


16 


1.15 


-1.0 


0.9999695 (1 - 2" 15 ) 




Word 










20 


Word 


20 


1.19 


-1.0 


0.9999981 (1 - 2 -19 ) 



Register 
Partition 



Table V illustrates maximum and minimum values of data 

25 within memory or within the vector registers, depending 
upon whether the memory data is signed or unsigned. ALU 
116 performs all arithmetic operations using saturating arith- 
metic. Converse to modulo arithmetic, saturating arithmetic 
forces a value to be "clipped" if it is too large to fit in the 
destination. Modulo arithmetic merely wraps the large value 

30 back around leaving a remainder value. The clipping mecha- 
nism of saturating arithmetic is one whereby a maximum 
representative positive value is substituted for the oversized 
positive value. A similar substitution is done when the result 
is too negative. If the data is signed data, and the signed bit 

35 is set such that a negative value is represented, then if the 
negative value becomes too large to fit in the destination bit 
locations, a maximum representable negative value is sub- 
stituted for the oversized negative value. Table V illustrates 
the maximum and minimum positive and negative values 
which would be substituted if an overflow occurs. Saturating 

40 arithmetic is more suitable than modular arithmetic for 
performing operations upon image data or audio data. 

Vector add, subtract and Boolean instructions are per- 
formed on 10-bit or 20-bit quantities. If the result of an add 
or subtract operation goes outside the range offered by a 

45 10-bit or 20-bit partition, then the result is clipped to the 
largest positive or negative representable value. Boolean 
operations, however, are not clipped. The result of add, 
subtract and move vector instructions may optionally be 
shifted right by one bit before storing to the destination. The 
right-shift, or scaling operation, can be used to compensate 

50 for the tendency of the data magnitude to grow with each 
add and subtract operation. The add and subtract operations 
generate at most one bit of overflow; the scaled versions of 
add and subtract cause a shift of this overflow bit into the 
high bit of the result so that clipping can be avoided. 

5S Multiply instructions take two 10-bit or 20-bit signed 
operands, and generate a 1 9-bit or 39-bit signed product. The 
least significant 9 or 19 bits of the product are rounded and 
dropped before storing into the 10-bit or 20-bit destination 
register. An example of a multiply operation performed on 
two 10-bit signed operands is shown in FIG. 19. The 

60 resulting 19-bit intermediate product is rounded, and the 
least significant bits dropped to produce a 10-bit interme- 
diate product. The 10-bit operands which are multiplied 
together are shown by reference numerals 170 and 172. The 
19-bit intermediate product is shown as reference numeral 

65 174, and the intermediate product after the least significant 
bits are rounded and dropped is shown as reference numeral 
176. 
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Simple multiply operations do not require clipping since 
an overflow condition generally cannot occur. However, a 
multiply-and-accumulate (mac) vector instruction does 
require clipping of the operand product. The mac instruction 
is carried forth by adding the operand of the previous 5 
product to the current operand product and storing that 
summation as a final result. The previous product occupies 
a 10-bit location within a destination register, as shown by 
reference numeral 178. When the previous product 178 is 
added to the current product 176, a final result 180 is 
produced. Result 180 therefore represents a running sum of 10 
the multiply products. The running sum is shown as a 
clipped value, since summation of operands 176 and 178 
causes a negative value too large to fit within the 10-bit 
sub -slot. 

Rounding issues arise whenever an operation produces 15 
low-order bits that do not map into the destination format. 
Rounding occurs in the following vector instructions: round 
(rnd n), multiply (mul and mac), right-scaled additive opera- 
tions (add_, sub_, sbr__, and acum_), right -shift operations 
(asr n, lsr n), and store operations (vstb and vstw). When a 
round (rnd n) operation occurs, data is not shifted; instead, 
the low-order bits are set to 0. A "round to even" method is 
used when the rounded bits are exactly equal to one half of 
the designated least significant bit (bit n). In this case, the 
rounding direction is picked so that the result (from bit n up) 
is even. This convergent rounding eliminates any statistical 25 
bias on the direction of the rounding. In a multiply operation 
(mul and mac), the 20-bit partition versions of these opera- 
tions drop the lowest 19 bits of the 39-bit intermediate 
product. In a 10-bit partition version, 9 bits of the 19-bit 
intermediate product are dropped. These operations imple- 30 
ment simple rounding by adding a value of V4 of the 
destination operand least significant bit to the intermediate 
product before truncating it. To keep the multiplier data path 
as short as possible, rounding is not convergent. If the bits 
to be dropped are exactly equal to Vz of the destination 35 
operand least significant bit, then the result is uncondition- 
ally rounded up. In right-scaled additive operations, right- 
shift operations and store operations, no rounding is per- 
formed. Instead, the lowest bil(s) are truncated. Generally 
speaking, regardless of the operation, if a rounded result is 
important for the operation that performs a truncation, then 40 
an explicit md n can be applied to the data prior to the 
operation. 

While fixed point arithmetic is used, there may be 
instances where block floating point operations would be of 
benefit. The magnitude check (mag) vector instruction is 
used to implement block floating point operations. If results 
from fixed point math become too small or large to fit in a 
destination register and clipping is not desired, then scaling 
the data to a block floating point value can occur. The mag 
instruction automatically checks for runs of up to seven Is 
or 0s. The mag instruction therefore checks all data follow- 
ing a computation, and scale instructions (asl or asr) scale all 
data according to the shortest run of Is or 0s. If the shortest 
run is seven or more bits, this still leaves sufficient dynamic 
range. Consequently, the mag instruction does not check 
beyond seven bits. This limitation significandy reduces the 
gate count (i.e., silicon area) necessary to implement this 
instruction. 

A distance instruction dist is also provided with the vector 
operational instructions listed in Table I. The dist instruction 
is useful for MPEG motion estimation. Motion estimation 
requires finding the difference between pixels in different 
frames. Pixel comparisons are done on a 16x16 pixel basis, 
called macroblocks. This operation requires finding the 
difference between two pixel values (the error) and summing 
the errors. 

There are no limitations on using the vector instructions 
of Tables I and II concurrent with integer instructions. 



Further, there are no limitations on mixing the vector 
instructions with floating point instructions (i.e., x87-type 
instructions). However, frequent switching between vector 
instructions and floating point instructions may cause the 
microprocessor to stall execution while it performs opera- 
tions to maintain coherency between the MEU and floating 
point units. Thus, while a portion of the floating point 
registers may be dedicated to vector registers useable with 
an MEU, coherency between those registers and non-MEU 
floating point registers may be needed. The vector registers 
are designated and correspond to the physical floating point 
unit registers. Thus, the floating point unit physical register 
0 is the same as the lower half of MEU vector register V0, 
and the floating point unit physical register 1 is the same as 
the upper half of MEU vector register V0. This mapping of 
vector registers to floating point unit registers continues such 
that the floating point unit physical register 7 is the same as 
the upper half of MEU vector register V3. 

An x86 processor has two bits in the CR0 register to help 
manage task switching and emulation for floating point 
code. The two bits are designated the TS bit and the EM bit. 
The TS bit is set whenever a task switch occurs. While the 
TS bit is set, interrupt seven is called when any floating point 
unit instruction is encountered. The operating system han- 
dler for interrupt seven saves the floating point unit state and 
resets the TS bit. This scheme allows the operating system 
to save the floating point unit state only for tasks that 
actually use the floating point unit. The MEU uses the TS bit 
in the same way as the floating point unit. Any MEU 
instruction that is encountered while the TS bit is set also 
causes assertion of interrupt seven. The EM bit is intended 
to help implement software emulation of the floating point 
unit. When the EM bit is set under software control, any 
floating point unit instruction causes an interrupt seven. 
However, execution of MEU instructions do not cause an 
interrupt seven to occur since, if the MEU exists, there is no 
need to emulate its instructions. 

Conventional floating point units comprise three registers 
for status and control: the floating point unit status word, 
control word and tag word. These registers contain bits for 
exception flags, exception mask, condition codes, precision 
codes, rounding control and stack tags. The MEU does not 
use or modify any of the above bits except for the stack tag 
bits. The MEU modifies the stack tag bits because MEU 
result values are often not valid floating-point numbers. Any 
time an MEU vector instruction is executed, the entire 
floating point unit tag word is set to OFFFFh, marking all 
floating point unit registers as empty. In addition, the top- 
of-stack pointer in the floating point unit status word (bits 
11-13) is set to 0, indicating an empty stack. Thus, any 
vector instruction effectively destroys any floating-point 
values that may have been in the floating point unit. This is 
not of concern since between task switches the OS 
(operating system) saves and restores the complete floating- 
point unit stack for each task. Use of both MEU instructions 
and floating point unit instructions within the same task is 
generally undesirable, and may require saving of the state of 
the floating point unit/MEU registers between the execution 
of any two instructions of differing types. 

Merely to help understand the various vector instructions 
and a practical purpose of such instructions, code which 
implements the 1DCT flowgraph of FIG. 4 is as follows: 



vlds vOl, [esi] 
vtds vOh, [esi+8] 

{mul mul mul mul mul mul mul mul} word vO, v2, v0(371 56240) 
65 {add subr add sub add sub subr add} word vO, vO, v0(45 672301) 
{mul mul mul mul mul mul mul mul} word vl, v3, vl (5 6547264) 



45 
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{a cum add add add a cum add acum acum} word vO, vl, vl (70312ZZZ) 
{add, add, sub, subr, subr, subr, add, add} word vO, vO, vO (654Z0123) 
{subr, subr, subr, subr, add, add, add, add} word vO, vO, vO (01234567) 
vsts (esij, vOl 
vsts [esi+Sj, vOh 



The csi register points to the data, and vector registers v2 
and v3 are pre-loaded with the constant coefficients of the 10 embrace 



IDCT algorithm. The above code illustrates many of the 
vector instructions (operational vector instructions and load/ 
store vector instructions) as they pertain to MPEG, and more 
specifically the IDCT algorithm in MPEG decoders. 

As another example, code can be written to implement a 
stretch BitBlt algorithm. The ORU 124 proves particularly 
beneficial in the BitBlt algorithm, as evidenced by the 
following code: 



blocks of a conventional x86 microprocessor. Thus, it is to 
be understood that the form of the invention shown and 
described is to be taken as presently preferred embodiments 
of an MEU having partitioned registers, possibly derived 
from a floating point unit, partitioned ALU and an ORU 
interposed therebetween. Various modifications and changes 
may be made to the processor core, as well as to each and 
every component of the MEU, as would be obvious to a 
person skilled in the art having the benefit of this disclosure. 
It is intended that the following claims be interpreted to 
all such modifications and changes and, 



vldw vO, [cai] ; get source pixels (16bpp) 

{mov mov mov mov raov mov mov mov} word vl, vO, v0(77665544) 
vstw [ebpl, vl ; store stretched pixels 

vstw [ebp+scanline], vl ; store stretched pixels 

{mov mov mov mov mov mov mov mov} word vl, vO, v0(33221100) 
vstw [ebp+16], vl ; store stretched pixels 

vstw [ebp+16+scanline], vl ; store stretched pixels 



The esi registers point to the source, and the ebp registers 
point to the destination. In the BitBlt example, the source 
pixels are copied to 4x of the original size, wherein pixels 
are assumed to be 16 bits per pixel. 

As another illustrative example, code can be written to 
perform the inner loop of MPEG motion estimation as 
follows: 



20 
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accordingly, the specification and drawings are to be 
regarded in an illustrative rather than a restrictive sense. 

What is claimed is: 

1. A computer, comprising: 

an input/output device operably coupled to a 
microprocessor, wherein the microprocessor includes: 
an instruction cache configured to store coded first and 
second sets of instructions obtained from the input/ 
output device, wherein said first set of instructions 
comprises integer instructions for operating on inte- 
ger operands and said second set of instructions 
comprises vector instructions for operating on vector 
data; 

a decode unit configured to decode said vector instruc- 
tions; and 

wherein said microprocessor is configured to perform a 
load of data having a first bit size from a first 
memory location having said first bit size to a 
register slot having a second bit size, wherein said 
first bit size is smaller than said second bit size, and 
wherein said microprocessor is configured to per- 
form an unpacking operation during the load to fill 
said register slot, wherein said microprocessor is 
configured to load said data and perform said 



vldb vO, [esi] 
vldb vl, [edi] 

dist dist dist dist dist dist dist dist} byte vl, vl, vO (76543210) 

{acum acum acum acum acum acum acum acum} byte v2, vl , vl (76543210) 



The esi register points to the reference pixels (or pixels 
within the I frames) and the edi registers point to the search 
pixels. Vector registers vO and vl point to pixels to be 
compared, and vector register v2 contains the sum of the 
errors. After the errors have been summed, the partitions 
need to be summed together, as shown by the following 
code: 



unpacking operation in response to said decode unit 
decoding a single vector load instruction. 

2. The computer as recited in claim 1 wherein said second 
bit size comprises twice the number of bits of said first bit 
size plus four bits. 

3. The computer as recited in claim 2, wherein said 
microprocessor comprises a register partitioned into slots 



{add_ add_ add_ add_ add_ add_ add_ add_} byte v2, v2, v2 (Z7Z5Z3Z1) 

{add_ add_ add__ add_ add_ add_ add_ add_} byte v2, v2, v2 (ZZZ6ZZZ2) 

{add_ add_ add_ add_ add_ add_ add_ add_} byte v2, v2, v2 (Z7ZZZZZ4) 

(acum acum acum acum acum acum blbh acum} word v2, v2, v2 (ZZZZZZOZ) 

{add_ add_ add_ add_ add_ add_ add_ add_} byte v2, v2, v2 (ZZZZZZZ1) 



The blbh instructions are used to swap the partitions to 60 
generate the final add. 

It will be appreciated to those skilled in the art having the 
benefit of this disclosure that this invention is believed to be 
capable of performing various multimedia-type algorithms. 
Operations within the algorithms are performed in stages, 65 
wherein multiple operations in each stage are carried out in 
concurrent fashion and with minimal impact upon the core 



each having said second bit size, and wherein said micro- 
processor is configured to load the most significant bit 
locations of each slot with data from memory locations 
having said first bit size in response to said decode unit 
decoding said single vector load instruction. 

4. The computer as recited in claim 3, wherein each one 
of said slots is partitioned into an upper half sub-slot and a 
lower half sub-slot, and wherein said microprocessor is 



60 
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configured so that each said upper half sub-slot receives bits wherein said microprocessor is configured to perform a 

from a memory location having said first bit size and the load of data having a first bit size from a first memory 

lower half sub-slot receives a plurality of zero bits. location having said first bit size to a register slot 

5. The computer as recited in claim 1, wherein the upper having a second bit size, wherein said first bit size is 
half sub-slot receives more significant bits that the lower s smaller than said second bit size, and wherein said 
half sub-slot. microprocessor is configured to perform an unpacking 

6. The computer as recited in claim 1, wherein said operation during the load to fill said register slot, 
unpacking operation occurs within the same instruction whcrein ^ microproccssor fc configured to load said 
cycle as the vector load instruction. data and perform said unpacking operation in response 

7. The computer as recited m claim 1, wherein said vector „ t j , j j* ■ i « i j 

r . ... - ' . - . .10 to said decode unit decoding a single vector load 

instructions comprise instructions for loading fractional * ♦ H & & 

signed or unsigned digital values into said register slot. . J * m ^. niC • , • , - 

8. A computer, comprising: 14 ^ microprocessor as recited in claim 13, wherein 

;„„„,/„,„„„( \i • ° . . „ii #rt said second bit size comprises twice the number of bits of 

an input/output device operably coupled to a £ . , , r 

„ a Jl nr tu- t ~. said first bit size plus four bits. 

microprocessor, wherein the microprocessor com- «- ~* r 

pr ^ es . 15 15. The microprocessor as recited in claim 15, further 

an instruction cache configured to store coded first and comprising a register partitioned into slots each having said 

second sets of instructions obtained from the input/ second bit size, and wherein said microprocessor is config- 

output device, wherein said first set of instructions ur f d 10 load toe most significant bit locations of each slot 

comprises integer instructions for operating on inte- with data from memory location having said first bit size in 

ger operands and said second set of instructions 20 response to said decode unit decoding said single vector load 

comprises vector instructions for operating on vector instruction. 

data; 16. The microprocessor as recited in claim 15, wherein 

a decode unit configured to decode said vector instruc- each one of said slots is partitioned into an upper half 

tions; and sub-slot and a lower half sub-slot, and wherein said micro- 

wherein said microprocessor is configured to perform a 2 5 processor is configured so that each said upper half sub-slot 

store of data having a second bit size from a register receives bits from a memory location having said first bit 

slot having said second bit size to a first memory anc j me lower half sub -slot receives a plurality of zero 

location having a first bit size, wherein said first bit hits. 

size is smaller than said second bit size, and wherein 17. The microprocessor as recited in claim 16, wherein the 

said microprocessor is configured to perform a pack- 30 U p pe r half sub-slot receives more significant bits that the 

ing operation during the store on said data to fit said lower half sub-slot. 

data into said first memory location, wherein said \g jh c microprocessor as recited in claim 13, wherein 

microprocessor is configured to store said data and sa jd unpacking operation occurs within the same instruction 

perform said packing operation in response to said C ycle as th e vcctor instruction, 

decode unit decoding a single vector store instruc- 35 19. The device as recited in claim 13, wherein said vector 

tlOD - instructions comprise instructions for loading fractional 

9. The computer as recited in claim 8, wherein said second signed or unsigned digital values into said register slot, 
bit size comprises twice the number of bits of said first bit 20. The microprocessor as recited in claim 13 further 
size plus four bits, and wherein said microprocessor is comprising a register partitioned into slots each having said 
configured to store data from the most significant bit loca- 40 second bit size, and wherein said microprocessor is config- 
tions of said register slot into said first memory location in urec i to i oac j eacn s j ot ^th data from consecutive memory 
response to said decode unit decoding said single vector load locations having said first bit size in response to said decode 
instruction. um t decoding said single vector load instruction. 

10. The computer as recited in claim 8, wherein said 21. The microprocessor as recited in claim 13 further 
microprocessor comprises a register partitioned into slots 45 comprising a register partitioned into slots each having said 
each having said second bit size, wherein each one of said second bit size, and wherein said microprocessor is config- 
slots is partitioned into an upper half sub-slot and a lower urc d to load half of said slots with data from consecutive 
half sub -slot, and wherein said microprocessor is configured memory locations having said first bit size and load the other 
so that data is dispatched from each said upper half sub-slot na if 0 f sa jd slots with zeros, in response to said decode unit 
to a memory location having said first bit size in response to 50 decoding said single vector load instruction. 

said decode unit decoding said single vector load instruc- 2 2. The microprocessor as recited in claim 13 further 

ti° n - comprising a register partitioned into slots each having said 

11. The computer as recited in claim 10, wherein the second bit size and each slot comprising an upper and a 
upper half sub-slot dispatches more significant bits then the i ower sub-slot, and wherein said microprocessor is config- 
lower half sub-slot. 55 urec j to load each lower sub-slot with data from consecutive 

12. The computer as recited in claim 8, wherein said memory locations having said first bit size beginning at a 
packing operation occurs within the same instruction cycle nrst address and load each upper sub-slot with data from 
as the vector store instruction. consecutive memory locations having said first bit size 

13. A microprocessor, comprising: beginning at a second address offset from said first address 
an instruction cache configured to store coded first and 60 by a number of locations corresponding to the number of 

second sets of instructions, wherein said first set of slots in said register, in response to said decode unit decod- 

instructions comprises integer instructions for operat- ing said single vector load instruction, 

ing on integer operands and said second set of instruc- 23. The microprocessor as recited in claim 13 further 

tions comprises vector instructions for operating on comprising a register partitioned into slots each having said 

vector data; 65 second bit size and each slot comprising two sub-slots, 

a decode unit configured to decode said vector instruc- wherein loads to slots are treated as signed values and loads 

tions; and to sub-slots are treated as unsigned values. 
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24. A microprocessor, comprising: in response to said decode unit decoding said single vector 
an instruction cache configured to store coded first and store instruction. 

second sets of instructions, wherein said first set of 11 ' ^ dcvicc as rccitcd m claim 26 > wherein the upper 

instructions comprises integer instructions for operat- ^ sl ° l ^patches more significant bits then the lower 

ing on integer operands and said second set of instruc- 5 - 0 SU „" . ... . . ~. , 

* • e *■ 28. The microprocessor as recited in claim 24, wherein 

lions comprises vector instructions for operating on . . r ^ ... . 4 . . ' 

, r r & saic j p ac ki n g operation occurs withm the same instruction 

vec or a a, cycle as the vector store instruction. 

a decode unit configured to decode said vector instruc- 29. The microprocessor as recited in claim 24, further 

tions; and comprising a register partitioned into slots each having said 

wherein said microprocessor is configured to perform a second bit size, and wherein said microprocessor is config- 

store of data having a second bit size from a register ured to dispatch data from said slots to consecutive memory 

slot having said second bit size to a first memory locations having said first bit size in response to said decode 

location having a first bit size, wherein said first bit size unit decoding said single vector store instruction, 
is smaller than said second bit size, and wherein said J5 30. The microprocessor as recited in claim 24 further 

microprocessor is configured to perform a packing comprising a register partitioned into slots each having said 

operation during the store on said data to fit said data second bit size, and wherein said microprocessor is config- 

into said first memory location, wherein said micro- ured to dispatch data from half of said slots to consecutive 

processor is configured to store said data and perform memory locations having said first bit size and not dispatch 

said packing operation in response to said decode unit data from the other half of said slots, in response to said 

decoding a single vector store instruction. decode unit decoding said single vector store instruction. 

25. The microprocessor as recited in claim 24, wherein 31. The microprocessor as recited in claim 24 further 
said second bit size comprises twice the number of bits of comprising a register partitioned into slots each having said 
said first bit size plus four bits, and wherein said micropro- second bit size and each slot comprising an upper and a 
cesser is configured to store data from the most significant 25 lower sub-slot, and wherein said microprocessor is config- 
bit locations of said register slot into said first memory ured to dispatch data from said lower slots to consecutive 
location in response to said decode unit decoding said single memory locations having said first bit size and beginning at 
vector store instruction. a first address, and dispatch data from said upper slots to 

26. The microprocessor as recited in claim 24, wherein consecutive memory locations having said first bit size and 
said microprocessor comprises a register partitioned, into 3Q beginning at a second address offset from said first address 
slots each having said second bit size, wherein each one of by a number of locations corresponding to the number of 
said slots is partitioned into an upper half sub -slot and a slots in said register, in response to said decode unit decod- 
lower half sub-slot, and wherein said microprocessor is ing said single vector store instruction. 

configured so that data is dispatched from each said upper 

half sub-slot to a memory location having said first bit size ***** 
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